New argos model en_ru and ru_en

Hello,

This is a new model for the en-ru and ru-en.

I trained the model with LibreTranslate/Locomotive on 148M pairs of filtered sentences.
With an effective batch size of 200k

The model is trained on cleaned text corpora https://opus.nlpl.eu/ corpus:
CCMatrix.en-ru, MultiUN.en-ru
News-Commentary.en-ru, OpenSubtitles.en-ru, QED.en-ru, Tatoeba.en-ru, TED2013.en-ru, TED2020.en-ru and on our own collected corpus of back-translated sentences.

The calculated BLEU scores on the flores200 dataset for the models averaged 66 points.

argostranslate-ru_en_and_en_ru

2 Likes

For comparison, I also asked you to evaluate ChatGPT:

Compare these two translations and rate them on a scale from 0 to 10. Also describe in detail the reasons for making such ratings.

  1. (current model)
    Many of Shakespeare’s plays were published in his life, different quality and accuracy. In 1623, two friends of Shakespeare, John Heming and Henry Condelle, published the First Folio, a meeting of all but two, the Shakespeare Pies, currently on the canon. Tom was prejudged by Ben Johnson’s poem, in which Johnson saw Shakespeare in the quote now known as “not a century but at all times”. Later, Shakespeare was given several more plays (or fragments) with varying degrees of evidence by various researchers.

Shakespeare’s life has been receiving commendable feedback on his work, but he’s really popular only in the nineteenth century. In particular, the representatives of romanticism and the Viktorians were so inclined in front of Shakespeare, that Bernard Shaw called it "bardolatry”, which in translation from English means “bardoprocity”. Shakespeare ’ s studies remain popular and are now under constant review and rethinking in accordance with political and cultural conditions.

  1. (new model)
    Many of Shakespeare’s plays were published during his lifetime, of varying quality and accuracy. In 1623, two friends of Shakespeare, John Heming and Henry Condell, published the First Folio, a collection of all but two Shakespeare plays currently included in the canon. Tom was preceded by Ben Johnson’s poem, in which Johnson visionaryly hailed Shakespeare in the now-famous quote as “not a century, but for all time.” Later, various researchers attributed several more plays (or fragments) to Shakespeare with varying degrees of evidence.

Already during his lifetime, Shakespeare received laudable reviews about his works, but it really became popular only in the XIX century. In particular, representatives of Romanticism and Victorians so admired Shakespeare that Bernard Shaw called it “bardolatry”, which in English means “bardolatry”. Shakespeare’s works remain popular today, and they are constantly studied and reinterpreted in accordance with political and cultural conditions.

Here is one of his answers for the RU_EN model:

Comparison of Translations

Rating:
I would rate the new model’s translation an 8 out of 10 and the current model’s translation a 5 out of 10.

Reasons for Ratings:

Clarity and Cohesiveness:
The new model’s translation is clearer and more cohesive, earning it a higher rating. It maintains a consistent flow of information and utilizes language that is more coherent and easier to follow. The sentences are well-structured, and the information is presented in a logical order. In contrast, the current model’s translation suffers from issues related to sentence structure, coherence, and flow, which result in a lower rating.

Accuracy and Language Use:
The new model’s translation demonstrates improved accuracy and language use, which contributes to its higher rating. It accurately conveys the intended meaning and uses language that is more appropriate and precise. On the other hand, the current model’s translation contains several instances of awkward language use and inaccuracies, resulting in a less favorable rating in this regard.

Consistency and Terminology:
The new model’s translation maintains better consistency in terminology and expression, earning it a higher rating. It uses consistent terminology and expressions throughout the text, contributing to a more cohesive and polished translation. In contrast, the current model’s translation lacks consistency in terminology and expression, leading to a lower rating in this aspect.

Overall, the new model’s translation demonstrates improvements in clarity, accuracy, language use, consistency, and terminology, which collectively contribute to its higher rating. The current model’s translation, while containing the core information, is marred by issues related to coherence, accuracy, language use, consistency, and terminology, resulting in a lower rating.

1 Like

Nice, thanks for sharing! This is exciting, I will have to give the model a try.

What was your approach for cleaning the OPUS data? Would you be available to share the scripts that cleaned the data also? Perhaps they could be merged into Locomotive.

I actually used a python script for cleaning that I wrote myself and given my knowledge of python, I’m a little embarrassed to show it.

Regarding the cleaning algorithm:

  1. I compare the source.txt and target.txt case files line by line

  2. If there are unparallel brackets or quotation marks in a sentence, I delete the extra ones.

  3. If there is a special character | reserved in opennmt-py in the sentence, I delete it too, otherwise it may lead to errors during training (it happened on CCMatrix corpus).

  4. filter the sentences so that they are at least 20 and not more than 500 characters long (customizable).

  5. Check the ratio of source and target sentence lengths (from 0.6 to 1.5, customizable)

  6. Filter out sentences if they have more than 40% characters.

  7. Filter out sentences if they contain more than 40% numbers.

  8. I check the sums of digits in source and target and if they don’t match in the sentences, I filter them out too.

  9. I check if there are words and symbols from the STOP dictionary in the sentences, and if there are - I filter them too. It was useful to remove sentences with the list of countries that voted in MultiUN, or the designations of sounds in subtitles. For example:
    filter_list = [“http:”, “http :”, “&amp”, “( Laughter )”, “( Applause )”, " Committee by :", "Against : ", “(Laughter)”, “(Applause)”]

  10. I also look for and remove duplicates, but in the vicinity of 1000 lines, otherwise it is very long in terms of time and memory consumption.

  11. It is also possible to check for sentence language matching, but this is a rather resource-intensive, time-consuming operation on huge corpora.

Filtering a corpus with 150M sentences took about 40 minutes on a ryzen 7950x (single thread).

I would also like to note that for some languages the algorithm in this form will not work, for example, for hieroglyphs.

As for Locomotive, I like this training program very much. However, I found some problems there, including training on multiple video cards, checkpoint averaging algorithm and continuation of training (files are not sorted by name correctly and for example it turns out that step 9000 is bigger than 10000). But I haven’t gotten to creating issues on github yet.

3 Likes

Don’t be, none is going to care if it’s pretty or not :slight_smile: plus it could be a starting point for people to help improve it.

Yep, the config in train.py needs to be tweaked to allow multi-GPU. But should be relatively easy to add support for it.

I’ve briefly tested the models, they seem to perform quite well in a small sample of translations! They retain more meaning compared to the current, especially when performing backtranslation (en → ru → en).

2 Likes

All right. There’s an archive:

.env - the file where you set the basic script settings (directory for processing, languages, etc.)
filter_tool.py - the filtering script itself
BT_libre_transl.py - script for creating back translation
split_to_sents.py - script for splitting text into sentences for back translation (each sentence - from a new line).

filter_and_BT.zip

3 Likes

Awesome work thanks! I like the strategy of using ChatGPT to rate the quality of different translations.

Be careful about removing all of the short sentences under 20 characters. It may lead the model performing poorly when users try to translate an individual word because it wasn’t trained on any examples of an individual word being translated.

1 Like

Yes, I agree! I’ve thought about this problem and when looking at text corpora, especially ones like CCMatrix, I decided that 20 characters would be a compromise for filtering, since most of the time such short sentences are Chapter Numbers and such.

However, for filtering such corpora as OpenSubtitles or Tatoeba, I set the minimum length parameter to 0 characters, taking into account the specifics of these corpora.

I also have a question about the size of models that are worth posting. I tried experimenting by training models of larger size (hiden1024, ff4096) I get 210M parameters and this is what I noticed:

  • High quality is achieved in fewer steps.
  • With almost the same BLEU score, the larger model is subjectively far superior to the smaller model in text fluency and quality.
  • I’ve tried giving translations of the large model to people and often they don’t recognize that they have translated text in front of them.
  • However, the translation speed of the larger model is proportionally worse as the size increases (about 3.5x).
1 Like

By the way, what if we add a separate corpus - dictionary?
I just have one for 230k words, I should try it.

If you have dictionary data I would definitely recommend including it.

I’ve used this script before to collect Wiktionary data:

1 Like

Thanks for the script, it will be very useful!
I will try to add more data and corpus of dictionaries/phrases and refine the current models, see how it affects the translation of individual words and text in general. Very interesting!

1 Like

I’m fine uploading slightly larger models to the package index. 210M parameters is fine. I’m guessing this model would still be under 200MB compressed.

That’s a great result! We should experiment with using Hidden1024 and FeedForward4096 with other languages. The might be a better default config than what we currently have.

Yes, I can confirm, the finished package takes about 200mb.
Then I can change the dataset (to improve translation of individual words) and train large models at once.
Then I can put them and the training configuration on the test by January 15.

2 Likes

I just finished writing this tool GitHub - LibreTranslate/RemoveDup: Remove duplicates from parallel corpora which can address the time/memory consumption issue for removing duplicates.

2 Likes

I am posting a large EN_RU model for 210M parameters.
It has corrected translation of some words and expressions (after adding dictionary corpus).
Also the translation became more correct and the number of small, but very spoiling translation errors decreased, especially on news and scientific articles.

BLEU score: 67.91297.

The RU_EN model will be posted, as I promised earlier, by January 15, just there its training should be finished.

translate-en_ru-2_2.argosmodel

2 Likes

For some reason all the previous files were deleted from the hosting.
If you can tell me how to optimally upload files to the form, I will be very grateful!

I personally like Dropbox or Google Drive to upload large files.

2 Likes

I’ve also just added the ability to perform filtering and transformations directly from Locomotive :boom: based on the ideas from this thread.

This should make it easier to cleanup data sources.

2 Likes

Great job! This will make filtering the enclosures so much easier, thanks!

I would also like to warn you to be careful with nonalphanum_count_mismatch and uppercase_count_mismatch filters.

For example (the translation is correct, but will fall under both filters):

source: It’s fun, I think. (2uppercase,3nonalphanum)

target: Это весело, я так думаю. (1uppercase,2nonalphanum)

I had to retrain the models twice before I realized that these filter conditions should be excluded, otherwise the translation was unnatural.

2 Likes

I’m done with training the RU-EN model.
It took 3 attempts and about 22 days, taking into account the nuances discovered during the training process with the dataset and hyperparameters of the model itself.
The results are in the table below:

EN_to_RU models:

Model BLEU COMET-22 Model size PPL Explanations and Notes
1.7 28.55 0.8480 62M - Current model in index.
2.1 29.73 0.8608 62M 12 Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
2.2 32.97 0.8835 209M 9.6 Big model. A dictionary has been added that can translate individual words.
GoogleTranslate 35.52 0.9060 - -

RU_to_EN models:

Model BLEU COMET-22 Model size PPL Explanations and Notes
1.0 27.75 0.8038 62M - Current model in index.
1.2 35.15 0.8545 62M 11.82 Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
1.3a 36.63 0.8598 209M 10.69 Big model. A dictionary has been added that can translate individual words.
1.3 38.31 0.8645 159M 10.51 DEEP model. A dictionary has been added that can translate individual words.
GoogleTranslate 41.56 0.8752 - -

All tests were performed on FLORES200 validation data.
Both of my models, according to the OPUS-MT Dashboard, are ahead of both the OPUS-MT model and the facebook/wmt19 model on FLORES200: (OPUS-MT rus-eng and OPUS-MT eng-rus).

However, GoogleTranslate is still a long way off, so far I have tried to keep the size of the models reasonable and at least approach the quality.

I propose to include these models in the index (en-ru v2.2 and ru-en v1.3):

translate-en_ru-2_2.argosmodel

translate-ru_en-1_3.argosmodel

3 Likes