New argos model en_ru and ru_en

If you have dictionary data I would definitely recommend including it.

I’ve used this script before to collect Wiktionary data:

1 Like

Thanks for the script, it will be very useful!
I will try to add more data and corpus of dictionaries/phrases and refine the current models, see how it affects the translation of individual words and text in general. Very interesting!

1 Like

I’m fine uploading slightly larger models to the package index. 210M parameters is fine. I’m guessing this model would still be under 200MB compressed.

That’s a great result! We should experiment with using Hidden1024 and FeedForward4096 with other languages. The might be a better default config than what we currently have.

Yes, I can confirm, the finished package takes about 200mb.
Then I can change the dataset (to improve translation of individual words) and train large models at once.
Then I can put them and the training configuration on the test by January 15.

2 Likes

I just finished writing this tool GitHub - LibreTranslate/RemoveDup: Remove duplicates from parallel corpora which can address the time/memory consumption issue for removing duplicates.

2 Likes

I am posting a large EN_RU model for 210M parameters.
It has corrected translation of some words and expressions (after adding dictionary corpus).
Also the translation became more correct and the number of small, but very spoiling translation errors decreased, especially on news and scientific articles.

BLEU score: 67.91297.

The RU_EN model will be posted, as I promised earlier, by January 15, just there its training should be finished.

translate-en_ru-2_2.argosmodel

2 Likes

For some reason all the previous files were deleted from the hosting.
If you can tell me how to optimally upload files to the form, I will be very grateful!

I personally like Dropbox or Google Drive to upload large files.

2 Likes

I’ve also just added the ability to perform filtering and transformations directly from Locomotive :boom: based on the ideas from this thread.

This should make it easier to cleanup data sources.

2 Likes

Great job! This will make filtering the enclosures so much easier, thanks!

I would also like to warn you to be careful with nonalphanum_count_mismatch and uppercase_count_mismatch filters.

For example (the translation is correct, but will fall under both filters):

source: It’s fun, I think. (2uppercase,3nonalphanum)

target: Это весело, я так думаю. (1uppercase,2nonalphanum)

I had to retrain the models twice before I realized that these filter conditions should be excluded, otherwise the translation was unnatural.

2 Likes

I’m done with training the RU-EN model.
It took 3 attempts and about 22 days, taking into account the nuances discovered during the training process with the dataset and hyperparameters of the model itself.
The results are in the table below:

EN_to_RU models:

Model BLEU COMET-22 Model size PPL Explanations and Notes
1.7 28.55 0.8480 62M - Current model in index.
2.1 29.73 0.8608 62M 12 Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
2.2 32.97 0.8835 209M 9.6 Big model. A dictionary has been added that can translate individual words.
GoogleTranslate 35.52 0.9060 - -

RU_to_EN models:

Model BLEU COMET-22 Model size PPL Explanations and Notes
1.0 27.75 0.8038 62M - Current model in index.
1.2 35.15 0.8545 62M 11.82 Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
1.3a 36.63 0.8598 209M 10.69 Big model. A dictionary has been added that can translate individual words.
1.3 38.31 0.8645 159M 10.51 DEEP model. A dictionary has been added that can translate individual words.
GoogleTranslate 41.56 0.8752 - -

All tests were performed on FLORES200 validation data.
Both of my models, according to the OPUS-MT Dashboard, are ahead of both the OPUS-MT model and the facebook/wmt19 model on FLORES200: (OPUS-MT rus-eng and OPUS-MT eng-rus).

However, GoogleTranslate is still a long way off, so far I have tried to keep the size of the models reasonable and at least approach the quality.

I propose to include these models in the index (en-ru v2.2 and ru-en v1.3):

translate-en_ru-2_2.argosmodel

translate-ru_en-1_3.argosmodel

3 Likes

That’s awesome! Thanks for sharing these.

2 Likes

Awesome! Thanks for this contribution

I just uploaded this model to argospm-index. It should be live now for new LibreTranslate installations.

Also available for download here:

3 Likes

@lynxpda. Those models you contributed are really awesome!

Also, I used a custom evaluation method measuring the input value of your models to a professional translator who would edit translated texts before official publication : the improvement is in the +20% bracket for RU-EN, and +10% for EN-RU, a lot more than the BLEU scores suggest indeed.

With your permission, I am looking forward to use the research you published to try to improve the German<->English models.

Then, when reading the blueprints, I noticed that some translations actually went worse off : for instance, “экономики с полной занятостью” translates as “full-time economy” instead of “economy with full employment” as was previously -and correctly- the case.

@pierotofy @argosopentech
Do you have any idea how to fine-tune the models so as to correct these hallucinations?

Although there is a “suggestions” option in LibreTranslate, I have not found how to use the suggestions.db once it exists.

2 Likes

@NicoLe First of all, I would like to say thank you very much for your appreciation of the quality of the models.

Due to the peculiarities of NMT it is very difficult to completely exclude translation inaccuracies:

  1. The translation is rather statistical in nature:
    • Most likely, the old model was trained on data containing a large share of the desired domain (economy, politics)
    • The new model was trained on a much larger amount of data from different domains (about 150M sentence pairs in total).
    • even Google and Deepl translators offer the possibility to adapt the translator to a specific domain, such as:

AutoML Translation beginner’s guide

The Translation API covers a huge number of language pairs and does a great job with general-purpose text. Where AutoML Translation really shines is for the “last mile” between generic translation tasks and specific, niche vocabularies. Our custom models start from the generic Translation API model, but add a layer that specifically helps the model get the right translation for domain-specific content that matters to you.

  1. Translation is very context dependent and for example document level NMT models will more often show a more coherent translation than sentence level models, as in our case.

In my case, the goal was to get a model that would, on average, produce acceptable results on most domains, while maximizing Quality/Size.

In reality, we have to try to find a compromise, since quality does not improve linearly with size and trying to avoid bias in training data from any domain/area.

If you want to customize the models for yourself, I am happy to share control points and training configuration files.

You can continue training by adding your own data (be careful here, you need to avoid the effect of catastrophic forgetting, I can share the corpus itself, but the total size is about 20Gb).

You can also try to train the Lora network and merge the resulting model with the main model (as far as I know, OpenNMT-py supports this functionality, but I haven’t had a chance to use it yet).

P.S. For my purposes of translating scientific and technical documentation, I have so far decided to go the following way: I collected a profile corpus of text for 42M sentences and formed a synthetic corpus using back translation to train a very large model (more than 400M parameters)).

I hope as a result to strongly outperform DeepL in translation quality (yep, ambitious), at least in this particular area, for professional purposes I think it is justified, but not for fast general translation models.

2 Likes

Perhaps this will help, I attach here a model weight calculator where I wrote and marked for myself the most interesting model options (base model highlighted in dark green).

transformer based model parameters calculator

A little later as I can, I’ll post the rationale for this selection of parameters, it’s more of a compilation of my findings from reading the preprints.

1 Like

Well, the little quirk I wrote about notwithstanding, it’s mission accomplished for your en-ru models and ru-en too.
I will first try to understand your proceedings, as I am not so familiar with data science.

1 Like

Speaking of weirdness, I think I have a clue what it might be related to and how to fix it (train it from scratch).

In the data I used a fairly large percentage of extracted CCMatrix bitexts and synthetic back-translation corpora.

I recently came across research where such text corpora are suggested to be labeled with a special token, e.g. src_prefix = <BT>

This should help to separate the data during training and make it clear to the model where the real texts are and where the synthetic ones are.

I have slightly modified Locomotive/train.py and am trying to train a Kabylian language model now, I can say that it works and allows to use the transfer learning method and train multi-language models.

Here are the links to the preprints that guided me:

Tagged Back-Translation

Tagged Back-translation Revisited: Why Does It Really Work?

Could you please send once again the script archive you mentioned on Dec26 in the issue?

I’m attaching the archive again. As is.

filtertool.zip

In general, almost all functionality is already implemented in Lokomotiv.
If you have any additional questions, you can also write in a private message, I will try to answer.

1 Like