New argos model en_ru and ru_en

lynxpda · December 26, 2023, 2:12pm

Hello,

This is a new model for the en-ru and ru-en.

I trained the model with LibreTranslate/Locomotive on 148M pairs of filtered sentences.
With an effective batch size of 200k

The model is trained on cleaned text corpora https://opus.nlpl.eu/ corpus:
CCMatrix.en-ru, MultiUN.en-ru
News-Commentary.en-ru, OpenSubtitles.en-ru, QED.en-ru, Tatoeba.en-ru, TED2013.en-ru, TED2020.en-ru and on our own collected corpus of back-translated sentences.

The calculated BLEU scores on the flores200 dataset for the models averaged 66 points.

argostranslate-ru_en_and_en_ru

lynxpda · December 26, 2023, 3:16pm

For comparison, I also asked you to evaluate ChatGPT:

Compare these two translations and rate them on a scale from 0 to 10. Also describe in detail the reasons for making such ratings.

(current model)
Many of Shakespeare’s plays were published in his life, different quality and accuracy. In 1623, two friends of Shakespeare, John Heming and Henry Condelle, published the First Folio, a meeting of all but two, the Shakespeare Pies, currently on the canon. Tom was prejudged by Ben Johnson’s poem, in which Johnson saw Shakespeare in the quote now known as “not a century but at all times”. Later, Shakespeare was given several more plays (or fragments) with varying degrees of evidence by various researchers.

Shakespeare’s life has been receiving commendable feedback on his work, but he’s really popular only in the nineteenth century. In particular, the representatives of romanticism and the Viktorians were so inclined in front of Shakespeare, that Bernard Shaw called it "bardolatry”, which in translation from English means “bardoprocity”. Shakespeare ’ s studies remain popular and are now under constant review and rethinking in accordance with political and cultural conditions.

(new model)
Many of Shakespeare’s plays were published during his lifetime, of varying quality and accuracy. In 1623, two friends of Shakespeare, John Heming and Henry Condell, published the First Folio, a collection of all but two Shakespeare plays currently included in the canon. Tom was preceded by Ben Johnson’s poem, in which Johnson visionaryly hailed Shakespeare in the now-famous quote as “not a century, but for all time.” Later, various researchers attributed several more plays (or fragments) to Shakespeare with varying degrees of evidence.

Already during his lifetime, Shakespeare received laudable reviews about his works, but it really became popular only in the XIX century. In particular, representatives of Romanticism and Victorians so admired Shakespeare that Bernard Shaw called it “bardolatry”, which in English means “bardolatry”. Shakespeare’s works remain popular today, and they are constantly studied and reinterpreted in accordance with political and cultural conditions.

Here is one of his answers for the RU_EN model:

Comparison of Translations

Rating:
I would rate the new model’s translation an 8 out of 10 and the current model’s translation a 5 out of 10.

Reasons for Ratings:

Clarity and Cohesiveness:
The new model’s translation is clearer and more cohesive, earning it a higher rating. It maintains a consistent flow of information and utilizes language that is more coherent and easier to follow. The sentences are well-structured, and the information is presented in a logical order. In contrast, the current model’s translation suffers from issues related to sentence structure, coherence, and flow, which result in a lower rating.

Accuracy and Language Use:
The new model’s translation demonstrates improved accuracy and language use, which contributes to its higher rating. It accurately conveys the intended meaning and uses language that is more appropriate and precise. On the other hand, the current model’s translation contains several instances of awkward language use and inaccuracies, resulting in a less favorable rating in this regard.

Consistency and Terminology:
The new model’s translation maintains better consistency in terminology and expression, earning it a higher rating. It uses consistent terminology and expressions throughout the text, contributing to a more cohesive and polished translation. In contrast, the current model’s translation lacks consistency in terminology and expression, leading to a lower rating in this aspect.

Overall, the new model’s translation demonstrates improvements in clarity, accuracy, language use, consistency, and terminology, which collectively contribute to its higher rating. The current model’s translation, while containing the core information, is marred by issues related to coherence, accuracy, language use, consistency, and terminology, resulting in a lower rating.

pierotofy · December 26, 2023, 3:34pm

Nice, thanks for sharing! This is exciting, I will have to give the model a try.

What was your approach for cleaning the OPUS data? Would you be available to share the scripts that cleaned the data also? Perhaps they could be merged into Locomotive.

lynxpda · December 26, 2023, 4:29pm

I actually used a python script for cleaning that I wrote myself and given my knowledge of python, I’m a little embarrassed to show it.

Regarding the cleaning algorithm:

I compare the source.txt and target.txt case files line by line
If there are unparallel brackets or quotation marks in a sentence, I delete the extra ones.
If there is a special character | reserved in opennmt-py in the sentence, I delete it too, otherwise it may lead to errors during training (it happened on CCMatrix corpus).
filter the sentences so that they are at least 20 and not more than 500 characters long (customizable).
Check the ratio of source and target sentence lengths (from 0.6 to 1.5, customizable)
Filter out sentences if they have more than 40% characters.
Filter out sentences if they contain more than 40% numbers.
I check the sums of digits in source and target and if they don’t match in the sentences, I filter them out too.
I check if there are words and symbols from the STOP dictionary in the sentences, and if there are - I filter them too. It was useful to remove sentences with the list of countries that voted in MultiUN, or the designations of sounds in subtitles. For example:
filter_list = [“http:”, “http :”, “&amp”, “( Laughter )”, “( Applause )”, " Committee by :", "Against : ", “(Laughter)”, “(Applause)”]
I also look for and remove duplicates, but in the vicinity of 1000 lines, otherwise it is very long in terms of time and memory consumption.
It is also possible to check for sentence language matching, but this is a rather resource-intensive, time-consuming operation on huge corpora.

Filtering a corpus with 150M sentences took about 40 minutes on a ryzen 7950x (single thread).

I would also like to note that for some languages the algorithm in this form will not work, for example, for hieroglyphs.

As for Locomotive, I like this training program very much. However, I found some problems there, including training on multiple video cards, checkpoint averaging algorithm and continuation of training (files are not sorted by name correctly and for example it turns out that step 9000 is bigger than 10000). But I haven’t gotten to creating issues on github yet.

pierotofy · December 26, 2023, 4:35pm

Don’t be, none is going to care if it’s pretty or not plus it could be a starting point for people to help improve it.

Yep, the config in train.py needs to be tweaked to allow multi-GPU. But should be relatively easy to add support for it.

I’ve briefly tested the models, they seem to perform quite well in a small sample of translations! They retain more meaning compared to the current, especially when performing backtranslation (en → ru → en).

lynxpda · December 26, 2023, 5:06pm

All right. There’s an archive:

.env - the file where you set the basic script settings (directory for processing, languages, etc.)
filter_tool.py - the filtering script itself
BT_libre_transl.py - script for creating back translation
split_to_sents.py - script for splitting text into sentences for back translation (each sentence - from a new line).

filter_and_BT.zip

argosopentech · December 26, 2023, 11:52pm

Awesome work thanks! I like the strategy of using ChatGPT to rate the quality of different translations.

Be careful about removing all of the short sentences under 20 characters. It may lead the model performing poorly when users try to translate an individual word because it wasn’t trained on any examples of an individual word being translated.

lynxpda · December 27, 2023, 8:07am

Yes, I agree! I’ve thought about this problem and when looking at text corpora, especially ones like CCMatrix, I decided that 20 characters would be a compromise for filtering, since most of the time such short sentences are Chapter Numbers and such.

However, for filtering such corpora as OpenSubtitles or Tatoeba, I set the minimum length parameter to 0 characters, taking into account the specifics of these corpora.

I also have a question about the size of models that are worth posting. I tried experimenting by training models of larger size (hiden1024, ff4096) I get 210M parameters and this is what I noticed:

High quality is achieved in fewer steps.
With almost the same BLEU score, the larger model is subjectively far superior to the smaller model in text fluency and quality.
I’ve tried giving translations of the large model to people and often they don’t recognize that they have translated text in front of them.
However, the translation speed of the larger model is proportionally worse as the size increases (about 3.5x).

lynxpda · December 27, 2023, 12:19pm

By the way, what if we add a separate corpus - dictionary?
I just have one for 230k words, I should try it.

argosopentech · December 27, 2023, 2:00pm

If you have dictionary data I would definitely recommend including it.

I’ve used this script before to collect Wiktionary data:

github.com

argosopentech/argos-train/blob/69e4a25230d4a300e7465709744688cf78ee96c0/bin/generate-wiktionary-data

#!/usr/bin/env python3

# https://github.com/tatuylonen/wiktextract

import os
import json
import argparse
from pathlib import Path

# Configure
sl = 'en'
tl = 'es'

parser = argparse.ArgumentParser()
parser.add_argument("wikidata", help="path to Wiktionary JSON file")
parser.add_argument("wikidata2", help="path to second Wiktionary JSON file (optional)")
args = parser.parse_args()

# Read JSON
wikidata = []

This file has been truncated. show original

lynxpda · December 27, 2023, 2:27pm

Thanks for the script, it will be very useful!
I will try to add more data and corpus of dictionaries/phrases and refine the current models, see how it affects the translation of individual words and text in general. Very interesting!

argosopentech · December 27, 2023, 3:03pm

I’m fine uploading slightly larger models to the package index. 210M parameters is fine. I’m guessing this model would still be under 200MB compressed.

That’s a great result! We should experiment with using Hidden1024 and FeedForward4096 with other languages. The might be a better default config than what we currently have.

lynxpda · December 27, 2023, 4:12pm

Yes, I can confirm, the finished package takes about 200mb.
Then I can change the dataset (to improve translation of individual words) and train large models at once.
Then I can put them and the training configuration on the test by January 15.

pierotofy · December 30, 2023, 5:42am

I just finished writing this tool GitHub - LibreTranslate/RemoveDup: Remove duplicates from parallel corpora which can address the time/memory consumption issue for removing duplicates.

lynxpda · December 31, 2023, 9:54am

I am posting a large EN_RU model for 210M parameters.
It has corrected translation of some words and expressions (after adding dictionary corpus).
Also the translation became more correct and the number of small, but very spoiling translation errors decreased, especially on news and scientific articles.

BLEU score: 67.91297.

The RU_EN model will be posted, as I promised earlier, by January 15, just there its training should be finished.

translate-en_ru-2_2.argosmodel

lynxpda · December 31, 2023, 10:02am

For some reason all the previous files were deleted from the hosting.
If you can tell me how to optimally upload files to the form, I will be very grateful!

pierotofy · December 31, 2023, 11:20am

I personally like Dropbox or Google Drive to upload large files.

pierotofy · December 31, 2023, 12:20pm

I’ve also just added the ability to perform filtering and transformations directly from Locomotive based on the ideas from this thread.

This should make it easier to cleanup data sources.

lynxpda · December 31, 2023, 1:46pm

Great job! This will make filtering the enclosures so much easier, thanks!

I would also like to warn you to be careful with nonalphanum_count_mismatch and uppercase_count_mismatch filters.

For example (the translation is correct, but will fall under both filters):

source: It’s fun, I think. (2uppercase,3nonalphanum)

target: Это весело, я так думаю. (1uppercase,2nonalphanum)

I had to retrain the models twice before I realized that these filter conditions should be excluded, otherwise the translation was unnatural.

lynxpda · January 24, 2024, 11:26am

I’m done with training the RU-EN model.
It took 3 attempts and about 22 days, taking into account the nuances discovered during the training process with the dataset and hyperparameters of the model itself.
The results are in the table below:

EN_to_RU models:

Model	BLEU	COMET-22	Model size	PPL	Explanations and Notes
1.7	28.55	0.8480	62M	-	Current model in index.
2.1	29.73	0.8608	62M	12	Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
2.2	32.97	0.8835	209M	9.6	Big model. A dictionary has been added that can translate individual words.
GoogleTranslate	35.52	0.9060	-	-

RU_to_EN models:

Model	BLEU	COMET-22	Model size	PPL	Explanations and Notes
1.0	27.75	0.8038	62M	-	Current model in index.
1.2	35.15	0.8545	62M	11.82	Dataset filtering and large effective batch size were applied. Doesn’t translate individual words.
1.3a	36.63	0.8598	209M	10.69	Big model. A dictionary has been added that can translate individual words.
1.3	38.31	0.8645	159M	10.51	DEEP model. A dictionary has been added that can translate individual words.
GoogleTranslate	41.56	0.8752	-	-

All tests were performed on FLORES200 validation data.
Both of my models, according to the OPUS-MT Dashboard, are ahead of both the OPUS-MT model and the facebook/wmt19 model on FLORES200: (OPUS-MT rus-eng and OPUS-MT eng-rus).

However, GoogleTranslate is still a long way off, so far I have tried to keep the size of the models reasonable and at least approach the quality.

I propose to include these models in the index (en-ru v2.2 and ru-en v1.3):

translate-en_ru-2_2.argosmodel

translate-ru_en-1_3.argosmodel