Adding a non supported language

argosopentech · May 10, 2022, 10:44pm

“Killed” normally means you ran out of RAM, you can try adding swap space.

The max data size config is to prevent this problem since the data is loaded into memory during training. I’ve had good results just excluding the largest datasets since they’re likely lower quality and cause problems.

I did some experiments with multiple servers to use CCMatrix and other large datasets preprocessed but ended up with worse results. Sometimes the max data size can exclude OpenSubtitles though which is a very high quality dataset in my experience.