The problem of translation quality and its solution

lynxpda · June 7, 2024, 11:33am

I propose to discuss in this topic a plan to solve the translation quality problem.
As you know, many standard models on 65M parameters, due to a number of reasons have poor translation quality.
I propose to consider this option to solve this problem:

Revise existing models and compare their performance with SOTA or at least Google Translate. This can be done automatically using COMET-22 and SacreBLEU (optional).
Make a list of languages whose models need to be updated (re-trained).
Agree on the maximum allowable size of the model, also agree on its architecture. In order to end up with a generic model with a maximum quality payoff relative to its size.
Train the base models EN_XXX and XXX_EN.
Create a script to transfer the training from base models to target models.
Train new models - use base models to transfer training to reduce training time and cost.

pierotofy · June 7, 2024, 3:21pm

I like the approach. One thing that I think is really important, is to keep model size small (as much as possible). People should be able to run the software on commodity hardware and at decent speed.

NicoLe · June 7, 2024, 4:26pm

Keeping the models as small as possible while upgrading quality, and finding a generic architecture able to do that in order to concentrate on the content afterwards is definitely what I am trying to do these last few months.

As of now, I have models that have 148M parameters and can be trained on the current Locomotive from a not so perfect dataset that anyone can create using Locomotive (when selecting the data on opus, it helps to know both languages well…). They already supersede or come to par with many automatic translators widely used in the industry. Data selection does 60% of the job, the architecture 40%.

Having run two or three trainings simultaneously on more than two datasets for the last 3 months, I came to be quite certain of the point. I cannot be sure that bigger models are not superior in terms of architecture, but on my datasets, the ones I have trained up to 400M parameters brought in the better case scenario only marginal improvement (less than one improved translation every 1000 characters) compared to the 148M ones. They also trained more slowly, some even infered more slowly, and the majority gave worse results.

A further improvement pioneered by lynxpda requires modifying both train.py and the Ctranslate2 dependency (PR in the making by another contributor, stable version awaited), and brings us at par with market leaders. I follow him closely to see where it leads me on the english-german language pair.

Once these improvements have been applied, the models I synthetized weigh 199M and have shallow decoder, so they’re quite fast.

Also, I’ve begun characterizing which data can be used in the french-english pair to prove the point. Since the quality is already quite good, I try to see how much better it is possible to go.

lynxpda · June 10, 2024, 7:56am

Yeah, I’d have to agree. In my research I also came to the conclusion that the optimal model size is 150M ± with deep encoder and shallow decoder.

In my opinion, this is the optimal configuration:

dec_layers: 18
decoder_type: transformer
enc_layers: 6
encoder_type: transformer
heads: 8
hidden_size: 512
max_relative_positions: -1 # RoPE
model_dtype: fp16
pos_ffn_activation_fn: gated-gelu
position_encoding: false
share_decoder_embeddings: true
share_embeddings: true
share_vocab: true
src_vocab_size: 32000
tgt_vocab_size: 32000
transformer_ff: 3072 # 4608 effective
word_vec_size: 512

As for testing, a little later I will write a script that should automate this process.

NicoLe · June 11, 2024, 4:11pm

The configuration looks good, but it’s quite difficult to implement (see the post about en-de model : to enable RoPE on Locomotive, you have to trust your luck a lot ).

People who do not have the knowledge to go around this may appreciate the alternative Shaw configuration with max_relative_positions = 20 or 32, which trains slower but should give comparable results.

For the transformer_ff, widening the transformer from 2048 to 4096 gave the biggest boost to translation accuracy compared with default or higher values, but I bypassed 3072, so I’ve got to check.

Gated activation function ‘silu’ may also do the job ‘gated-gelu’ does, @lynxpda got bad results on en-ru with it, mine are better than gated-gelu on de-en, do not know why, maybe related to the language pair (which is why I try training en-fr with these now). The Noam Shazeer preprint we base ourselves upon mentions a third decimal difference between them, so I think both are worth trying.
Note that for the time being, those gated functions only give a valid model when tweaking the CT2 converter and training with RPE or RoPE.

argosopentech · June 11, 2024, 6:52pm

150M seems reasonable to me. Downloading all of the models is a major bottleneck for a lot of LibreTranslate users but having high quality translations is also very important.

The models also don’t need to all be the same size, we should probably prioritize improving the most used language pairs.

NicoLe · June 12, 2024, 5:26pm

For the rationale of implementing rotary position encoding, read the following.
https://medium.com/@ngiengkianyew/understanding-rotary-positional-encoding-40635a4d078e

I’m trying to get a fix, the error I obtain with onmt 3.5 on valid.BLEU calculation might come from empty tokens generated by the split(" ") method introduced earlier this year in onmt train in place of the legacy split() which dumped empty string.

NicoLe · June 13, 2024, 8:34am

From my discussion on Github with @vince62s, who’s one of the main contributors to the OpenNMT-py and CTranslate2 projects, there’s effectively a bug in whitespace tokenization in onmt v3.5.

As of training with RoPE using onmt 3.4.3, I have the following observation : in the absence of rotary_theta parameter, positional encondig is rendered unambiguously only on contiguous dimensions.

Hence, any token used in both source and target language with different acceptions or constructs may eventually make training divergent.

I had tested “rotary 3.4.3” on both german->english and english->french to eliminate dataset and stanza library causes. Both training diverged around the end of the warming period. All three languages have quite a few tokens in common with different meanings or placement rules.

On more distant language pairs, with close to none such tokens, one may be lucky and training may converge, but it will amount to relative positional encoding i guess.

NicoLe · June 13, 2024, 11:46am

Having raised the issue with @vince62s, I’ll be coming up next monday with guidelines for a full RoPE implementation in Locomotive. Correct versions are the latest ones (CT2 4.3.1 and ONMT 3.5.1), previous versions as well as the PR#1687 have incomplete specs for this position encoding. Torch 2.1.1+cu121 has to be installed manually prior to upgrade (cuda runtime not available through requirements.txt).

The tokenization issue in ONMT 3.5 is quite severe and can hit upon training so I have to check it on at least 4 trainings before, but I have a working fix for train.py (introduce 2 transform arguments, valid_transform and train_transform instead of 1) tested on toy models with RoPE, RPE and default position encoding.

These trainings will also allow to characterize what RoPE brings in practice when translating.

I’ll restore default values for “src_seq_length” and “tgt_seq_length” (192) too, which should help translating longer sentences.

argosopentech · June 15, 2024, 6:55pm

I think the best strategy is to update the models incrementally, language by language. I recommend starting with the most widely used languages (@pierotofy may have some data on this from LibreTranslate.com). A lot of the most popular language pairs, English-Spanish for example, are still on version 1.0 which are models I trained in 2021 using the OpenNMT-tf default config.

As people train new models just post on this forum and I can update the index.

pierotofy · June 16, 2024, 3:12am

I currently don’t do any type of tracking on the translation API calls (as part of the site’s maximum privacy promise), but most traffic we get on the site is from US, India, Germany, China, Spain, Brazil and Russia. Not sure if that’s a good proxy for picking a language pair to improve.

NicoLe · June 19, 2024, 9:50am

As of language pairs to prioritize, the list of countries looks like a good guideline.

As of rotary position encoding, I still do not have an opinionated point of view because I am only done actually debugging the feature. Turns out, when using onmt-py 3.4.2 and further, flash attention2 is used when training models -bar those with “shaw” position encoding-. And said flash_attn2 does not support legacy GPUs, only recent ones, but no-one on the ONMT or CT2 projects has thought that maybe some retrocompatibility would be appreciated…

Past several foul trials and since only Shaw models were functional anymore after the upgrade, I dived into the specs and code (luckyly rotary is already implemented enough in onmt3.4.1 for what Locomotive does) and wrote a fix.

Using the following libraries, one can produce a “rope” model with the current Locomotive : just rename the extensions as .py and copy them in their respective python3x/Lib/sites-packages/ctranslate2/converters/ and ctranslate2/specs/ directories in place of the existing files.

opennmt_py.py - 320 rope alibi glu ~ pixeldrain opennmt_py.py - 320 rope alibi glu
transformer_spec.py - 320 rope alibi glu ~ pixeldrain transformer_spec.py - 320 rope alibi glu

Then apply the configuration features given by @lynxpda and it’s a go.

NicoLe · June 28, 2024, 7:23pm

Having tried the various possible configuration using RoPE, I cannot recommend it.
Most trainings with gated activation functions exihibited vanishing gradients, others aborted on early stopping, only one went through. Trainings with default relu activation went through but the results are inferior to vanilla DEEP with basic position encoding.
It does not solve at all syntactic issues as a most advanced position encoding should. And last, training is not that fast, convergences start fast but it plateaus afterwards.

Then, it may well be related to me using second generation GPUs, so if your GPU is third generation, feel free to upgrade ONMT to 3.5.1, pytorch to 2.1, CTranslate to 4.3.1, install fast-attn2, modify train.py (to fix a scoring bug in opennmt) and pull the adapted CTranslate2 patch (for rope and gated activation), and prove me wrong. I’ll invest in something more recent.
train.py ~ pixeldrain train.py
opennmt_py.py - 431mergedPR1687 ~ pixeldrain opennmt_py.py - 431mergedPR1687
transformer_spec.py - rope ~ pixeldrain transformer_spec.py - rope
3 files ~ pixeldrain All 3 files

But for now, I will either stick to 148M “DEEP” model whose results were the most reliable

enc_layers : 18
dec_layers : 6
transformer_ff : 4096
vocab_size : 32000

or the 161M gated-gelu model with Shaw 20 position encoding (“silu” is almost equivalent, but slightly inferior and has some early stopping issues)

enc_layers : 18
dec_layers : 6
transformer_ff : 3072
vocab_size : 32000
pos_ffn_activation_fn : gated-gelu
position_encoding : false
max_relative_positions : 20

which train slowly but steadily and have better evaluation metrics on my training server than those using rotary position encoding.

NicoLe · July 10, 2024, 10:10am

After training on english-french pairs also, I have come to the conclusion that the training data is really the dominant factor in the model’s quality.

Tried first on a mix of excerpts from CCMAtrix, UNPC, Multi-UN, EuroPat, TEDs and DGT (UE) to which I added the Canadian hansards and Cadlaws. Totalling 55M selected sentences pairs, I came to slightly better results than the existing models, but nothing really exciting : said improvement was due to the hyperparameters.

Then, I reduced the dataset to 25M sentences pairs, keeping only the best excerpts, got a small jump in COMET score of 0.4 points. Nothing too noticeable on a manual evaluation.

Added 20M backtranslated sentences from the Leipzig University Wortschatz corpora. Scores went back to the values I had had with the 55M sentences pairs. So much for backtranslation.

Incidentally, I was instructed to look for an LLM-backed translator, and came across TowerInstruct. The team that developed it explicitly used the reference-free wmt22-cometkiwi-da model to filter their training data. They reach a 0.8824/0.884 COMEt score after training with only 2M sentence pairs per language. For comparison, Google is 0.8922/0.8992, LT1.9 0.863/0.882, and I barely reach 0.873 towards french and 0.890 towards english.

So, I devised a script to calculate the comet scores and sort (~10k a minute, a little more than 10 million a day) sentences pairs according to it. Not a filter, since I have to find out first how they distribute on various corpora but the good threshold value for a filter might be between 0.85 and 0.89 depending on how selective one wants to be.

The goal is to get the best 25M sentence pairs comet-wise and train models on them. For now, the script is single GPU only, but the code can be optimized for multiple GPUs.

NicoLe · September 19, 2024, 9:35am

Two months later, I can tell that filtering the data with cometkiwi really helps even though it will bias the dataset towards small sentences. So I will describe my current pipeline (which I have to finish automating with mlflow or something equivalent).

Point zero : benchmarking. Translate flores200dev and devtest using Google, Reverso and Systran, which are legacy firms in MT, and DeepL, and compute their COMET and BLEU scores for both sets.

First step, one has to select the right corpora on opus, and filter them accurately, eliminating duplicates, sentences containg to much digits or non-alphanum characters (typically more than a quarter), and sentences which length ratio is to far apart (the ratio varies, a rule of thumb is to check the corpora’s size in the cache, calculate r = size[from]/size[to], and filter between 1/r² and r²).

NLLB/CCMatrix is a must for the model to be able to generalize. Since the corpus is ordered following LASER scores, the last 10% are usually awful, so use the top filter, and char_length between 20 and 500. On high-resource languages, some rotten tomatoes within the corpus may derail training, so use a combination of top and excerpt filters to scan and filter chunks under the “1.1.CCMatrixx%-y%” header, and be able to eliminate the problematic ones if this occurs. Regroup them before training though to eliminate duplicates : up to half of the sentences pairs in NLLB may bee duplicates.
MultiUN and UNPC are top-quality : the UN employs the best human translators, download them together under the “1.1.UN” header filtering only duplicates since they recover each other a lot,
Then, there are many other useful corpora, which need further filtering : DGT EuroParl and CORDIS for European languages, the talks -QED, TED2020, TED2013, NeuLab_TedTalks, GlobalVoices, Wikimatrix, plus some that are more specific, EuroPat, TildeMODEL, giga-fren etc… and also official PR corpora. Many of these contain idiotisms (“Laughter”), some mismacthed sentences (EuroPat), chapter and section numbers… filter that with anything you have in Locomotive, “char_length”, “contains”, … Others (DGT, QED, giga-fren, TildeMODEL) do contain allogenic languages, which is more tricky. A script that runs “langdetect” directly in the cache repertory and replaces the original corpora in the end will do the job without needing a GPU: verify the results after a couple of minutes and if it yields a significant percentage of allogenous sentences, let it run further.
After the previous job, recalibrate the filters and create a dataset under the header “1.1.filters”.
Those 3 datasets (filters/UN/CCMatrix) contain less than 0.5% of allogenic sentences and idiotisms, run them through a script that first loads cometkiwi, than calculates the scores, then reports the median, the peak distrib, the number of sentences above the peak and a distribution graph. The whole operation uses little VRAM but is very GPU-intensive, so it better run on a chip with a lot of CUDA cores.
What these feature usually is a gamma distribution centered around 0.8/0.9, with a median between 0.8 and 0.87, so restrict the graph to values above 0.8 first. But on a low-resource language (where the GoogleTranslate model for documents is corrupt ), NLLB superposes two gamma distributions, the worst one the bigger. Actually these two distributions are most often present in NLLB, but the second one is usually concentrated at the very end of the corpus.
Filter the synthetic corpora either around the scores’ median, or at a high enough threshold to avoid including too much low scoring sentence pairs.
Train using the above-mentioned hyperparameters until training completes.
If no early stopping has occurred, it may be wise to continue training.

ArtanisTheOne · September 20, 2024, 6:32pm

I’ve employed another filter in my pipeline which looks at relative sentence length. I used flores200 to calculate the average sentence ratio from English sentence X, which will always be 1.0 to non-english language Y (I focus on english-centric multilingual models), store that in a JSON file then reference that on top of running semantic similarity.

Just a simple ratio to check how far off the character ratio is from English:Non-English a.k.a 1:X . From there you can set a maximum absolute difference that you want in the dataset, and boom. My value was strict-ish because the languages involved are common, so there is a bunch of data if the filter cuts off some ambiguos examples.

This helped me a ton with Korean, Chinese Simplified, and Chinese Traditional, where translation data from CCMatrix / CCAligned commonly has alignment issues with either punctuation or repeats the src sentence in the translation, or even includes other languages, etc.

NicoLe · September 21, 2024, 2:48pm

Since there already is a filter for length ratio in Locomotive, I’ve been using it.
I came to the values after filtering data for 4 languages. I discovered that the initial values I used on English German were yielding less on English French, and then nothing on French German. Then the opposite was true on Armenian English and I had to figure out an empirically consistent formula.
To eliminate the sentences that are duplicated to the target or in another language, use langdetect. It’s quite fast and will also wipe everything idiotic or gibberish.
The cometkiwi score is more or less a scalar product of sentence vectors, so unaligned sentences and poorly translated sentences will fall in the lower scoresband can easily be filtered.
Only thing is, putting and running this into the pipeline is quite time consuming. Now, i spend more time curating data than training models. But this is what it takes to get good translations.

NicoLe · September 21, 2024, 2:57pm

I will PR the scripts to Locomotive soon: i also added two arguments to tye eval.py, pivot-from and pivot-to, thatballow to evaluate the models in regard to their accuracy in other languages than English.

In armenian, i’ve had too few quality sentences to train a good model to my destination language, even with backtranslations from English, so i had to compare between hy-en models pivoting to this language and the direct ones. I also needed this for French-English-German (the EU’s 3 official languages) to determine if the better option is a pivot (and which one) or a triangle.