Adding a non supported language

“Killed” normally means you ran out of RAM, you can try adding swap space.

The max data size config is to prevent this problem since the data is loaded into memory during training. I’ve had good results just excluding the largest datasets since they’re likely lower quality and cause problems.

I did some experiments with multiple servers to use CCMatrix and other large datasets preprocessed but ended up with worse results. Sometimes the max data size can exclude OpenSubtitles though which is a very high quality dataset in my experience.

1 Like

I get this sometimes:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 67: invalid continuation byte

Is there a problem in the data from OPUS directly? Or a download from OPUS, adjust, make zip, change name, upload to place A, training then downloads to place B… and the data is somehow getting slightly mangled in between those various transfers?

1 Like

I haven’t seen that before, my best guess is a bad character encoding somewhere in the data.

baby character encoding? Not familiar with that term :slight_smile:

Got this with a few data sources from OPUS for Romanian. Want to say CCAligned and OpenSubtitles. But also with CCMatrix for Norwegian.

Not sure who’s saying it - argos-train, sentencepiece, cTranslate2, etc…

And I know download a zip could get slightly mangled but still extractable (I do usually do test zip). Then extract, change some things, re-zip, upload. Then train downloads it, extracts…

I assume it’s GIGO of some sort. Just don’t know if it starts with the OPUS data itself, or any of the steps between download OPUS and running train. Going to try redoing Romanian from scratch, maybe just do one OPUS source data train, then add a second if that works, then add a third, etc.

1 Like

Woops, *bad character encoding.

I would like to knit a model pair of en-ro and ro-en, but since vast.ai is not free, I wanted to ask how many hours you estimate it can take.

It takes around 8 hours to train a model on a RTX 3090.

1 Like