All Indian Languages

argosopentech · December 13, 2024, 5:34pm

We are looking for all Indian Languages,

Kannada, Telugu, Tamil, Malayalam, Marathi, Gujarati, Odia, Kashmiri etc.

Thanks and Regards,
Srikanth

NicoLe · December 14, 2024, 7:43am

I advise you to train your own models using Locomotive. You may find sources on Github, and training data on opus (I’ll be updating the language codes soon, they are listed in data.py).

Crazzygamerr · March 6, 2026, 4:16am

Hey @NicoLe, I’m interested in giving the model training a try. The Locomotive guide was pretty straightforward, and I tried converting the OPUS models. They gave me scores of less than 20, but their README was listed near 90s?
python opus_mt_convert.py -s en -t ta
downloaded from OPUS-MT-models/en-ta/opus-2019-12-04.zip

python eval.py --config run/en_ta-opus_1.9/config.json
BLEU score: 3.39328
Also find it a bit weird that it kept doing a BLEU evaluation even though I didn’t request it. Wasn’t able to trial run it.

Any pointers on how to proceed? Should I try training from scratch?

NicoLe · March 6, 2026, 11:33am

BLEU is the default metric when using the eval script: it will always yield a BLEU whatsoever.

Your post is not completely clear about:

what (BLEU, COMET-22, whatever else?) scores the OPUS models yield and should yield?
a 3.39 BLEU is not a good indicator of the model’s quality. It rather suggests that the tamil script is not properly interpeted by the sacrebleu scorer (for chinese, korean and japanese, a good BLEU is 0) and the small score comes from some English inserts.

I would run two evals with arguments --comet and --flores_dataset dev or devtest to get: 1. COMET scores (above 0.8 is generally not bad, under 0.7 is real bad) 2. two sets of scores to see whether the model is really out of whack or if it’s an accident on the default evaluation dataset.

As of training… first, you should identifiy good sources,

NLLB is the bread and butter, take the first 40% with ‘top’ filter, it will yield the cream.
Samanantar, Anuvaad, pmindia are good
MultiHPLT needs the fast_lang and limit_latin_chars filter if you’re training from en to ta (otherwise, you’ll get lots of English inserts), only fast_lang from ta to en (then you want to be able to manage English inserts).
XLEnt brings named entities, you can also download manually the “dic” file from OpenSubtitles, convert it to moses format with a small script, and use it as a dictionary.
I usually include a bunch of other corpora with other filters for which you can use fast_lang as a shortcut (WikiMatrix, QED, TED2020, NeuLab-TedTalks, wikimedia, tico-19, Tatoeba, ELRC-wikipedia_health)

Over the course of the last two years, I’ve devised extra data processing which is not publicized, and quite time-consuming (at least as long as training) to get professionally useful quality, but if you get your sources right you shouldn’t come too far from it.

Then as a function of your GPU’s ability, choose an architecture for your model. Vanilla does not need much VRAM, but it does not yield very good models. The architectures I train need 25 to 40GB of VRAM to train so a gamer’s GPU may not be enough. If you have this oomph within your equipment, I’ll tell you what parameters work without too much tweaking.