How to fine tune a model the Azerbaijani language?

Omar_Ibrahim · July 1, 2024, 12:25pm

I checked the quality of the libretranslate translation English → Azerbaijani, it is very very poor quality, I want to train the model on the nllb dataset first,
how many t4 hours will it take and how to do it?

NicoLe · July 1, 2024, 4:52pm

Training a model takes about two days in one direction and as much in the other with Locomotive and a good (if not recent) GPU.

But know that there is no stanza package for “az” so, if you start training az->en, you’ll get an error on the stanza package (the stanza used in LT is a repurposed library for turkish, you can extract it from the existing package and put it in the running directory to fix the issue).

For en->az, it will take the english stanza and start training.

Then, you do not have to train a model on the whole NLLB, because NNLB sentences are ranked by “laser” score and the end of the corpus is usually not very good for training. Likewise, the beginning of the corpus may contain very short sentences that are not so interesting for training.

At first, try to train on NLLB with the “excerpt” filter, top percentile = 10, bottom percentile = 90, and also filter very short and very long sentences with the “charlength” filter between 20 and 500.

If you are satisfied with the quality after that, then fine. Otherwise, check the posts where @lynxpda and I participate in for further tips