“Killed” normally means you ran out of RAM, you can try adding swap space.
The max data size config is to prevent this problem since the data is loaded into memory during training. I’ve had good results just excluding the largest datasets since they’re likely lower quality and cause problems.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 67: invalid continuation byte
Is there a problem in the data from OPUS directly? Or a download from OPUS, adjust, make zip, change name, upload to place A, training then downloads to place B… and the data is somehow getting slightly mangled in between those various transfers?
baby character encoding? Not familiar with that term
Got this with a few data sources from OPUS for Romanian. Want to say CCAligned and OpenSubtitles. But also with CCMatrix for Norwegian.
Not sure who’s saying it - argos-train, sentencepiece, cTranslate2, etc…
And I know download a zip could get slightly mangled but still extractable (I do usually do test zip). Then extract, change some things, re-zip, upload. Then train downloads it, extracts…
I assume it’s GIGO of some sort. Just don’t know if it starts with the OPUS data itself, or any of the steps between download OPUS and running train. Going to try redoing Romanian from scratch, maybe just do one OPUS source data train, then add a second if that works, then add a third, etc.