The problem of translation quality and its solution

I propose to discuss in this topic a plan to solve the translation quality problem.
As you know, many standard models on 65M parameters, due to a number of reasons have poor translation quality.
I propose to consider this option to solve this problem:

  1. Revise existing models and compare their performance with SOTA or at least Google Translate. This can be done automatically using COMET-22 and SacreBLEU (optional).
  2. Make a list of languages whose models need to be updated (re-trained).
  3. Agree on the maximum allowable size of the model, also agree on its architecture. In order to end up with a generic model with a maximum quality payoff relative to its size.
  4. Train the base models EN_XXX and XXX_EN.
  5. Create a script to transfer the training from base models to target models.
  6. Train new models - use base models to transfer training to reduce training time and cost.
1 Like

I like the approach. One thing that I think is really important, is to keep model size small (as much as possible). People should be able to run the software on commodity hardware and at decent speed.

2 Likes

Keeping the models as small as possible while upgrading quality, and finding a generic architecture able to do that in order to concentrate on the content afterwards is definitely what I am trying to do these last few months.

As of now, I have models that have 148M parameters and can be trained on the current Locomotive from a not so perfect dataset that anyone can create using Locomotive (when selecting the data on opus, it helps to know both languages wellā€¦). They already supersede or come to par with many automatic translators widely used in the industry. Data selection does 60% of the job, the architecture 40%.

Having run two or three trainings simultaneously on more than two datasets for the last 3 months, I came to be quite certain of the point. I cannot be sure that bigger models are not superior in terms of architecture, but on my datasets, the ones I have trained up to 400M parameters brought in the better case scenario only marginal improvement (less than one improved translation every 1000 characters) compared to the 148M ones. They also trained more slowly, some even infered more slowly, and the majority gave worse results.

A further improvement pioneered by lynxpda requires modifying both train.py and the Ctranslate2 dependency (PR in the making by another contributor, stable version awaited), and brings us at par with market leaders. I follow him closely to see where it leads me on the english-german language pair.

Once these improvements have been applied, the models I synthetized weigh 199M and have shallow decoder, so theyā€™re quite fast.

Also, Iā€™ve begun characterizing which data can be used in the french-english pair to prove the point. Since the quality is already quite good, I try to see how much better it is possible to go.

3 Likes

Yeah, Iā€™d have to agree. In my research I also came to the conclusion that the optimal model size is 150M Ā± with deep encoder and shallow decoder.

In my opinion, this is the optimal configuration:

dec_layers: 18
decoder_type: transformer
enc_layers: 6
encoder_type: transformer
heads: 8
hidden_size: 512
max_relative_positions: -1 # RoPE
model_dtype: fp16
pos_ffn_activation_fn: gated-gelu
position_encoding: false
share_decoder_embeddings: true
share_embeddings: true
share_vocab: true
src_vocab_size: 32000
tgt_vocab_size: 32000
transformer_ff: 3072 # 4608 effective
word_vec_size: 512

As for testing, a little later I will write a script that should automate this process.

1 Like

The configuration looks good, but itā€™s quite difficult to implement (see the post about en-de model : to enable RoPE on Locomotive, you have to trust your luck a lot :slight_smile:).

People who do not have the knowledge to go around this may appreciate the alternative Shaw configuration with max_relative_positions = 20 or 32, which trains slower but should give comparable results.

For the transformer_ff, widening the transformer from 2048 to 4096 gave the biggest boost to translation accuracy compared with default or higher values, but I bypassed 3072, so Iā€™ve got to check.

Gated activation function ā€˜siluā€™ may also do the job ā€˜gated-geluā€™ does, @lynxpda got bad results on en-ru with it, mine are better than gated-gelu on de-en, do not know why, maybe related to the language pair (which is why I try training en-fr with these now). The Noam Shazeer preprint we base ourselves upon mentions a third decimal difference between them, so I think both are worth trying.
Note that for the time being, those gated functions only give a valid model when tweaking the CT2 converter and training with RPE or RoPE.

1 Like

150M seems reasonable to me. Downloading all of the models is a major bottleneck for a lot of LibreTranslate users but having high quality translations is also very important.

The models also donā€™t need to all be the same size, we should probably prioritize improving the most used language pairs.

2 Likes

For the rationale of implementing rotary position encoding, read the following.
https://medium.com/@ngiengkianyew/understanding-rotary-positional-encoding-40635a4d078e

Iā€™m trying to get a fix, the error I obtain with onmt 3.5 on valid.BLEU calculation might come from empty tokens generated by the split(" ") method introduced earlier this year in onmt train in place of the legacy split() which dumped empty string.

From my discussion on Github with @vince62s, whoā€™s one of the main contributors to the OpenNMT-py and CTranslate2 projects, thereā€™s effectively a bug in whitespace tokenization in onmt v3.5.

As of training with RoPE using onmt 3.4.3, I have the following observation : in the absence of rotary_theta parameter, positional encondig is rendered unambiguously only on contiguous dimensions.

Hence, any token used in both source and target language with different acceptions or constructs may eventually make training divergent.

I had tested ā€œrotary 3.4.3ā€ on both german->english and english->french to eliminate dataset and stanza library causes. Both training diverged around the end of the warming period. All three languages have quite a few tokens in common with different meanings or placement rules.

On more distant language pairs, with close to none such tokens, one may be lucky and training may converge, but it will amount to relative positional encoding i guess.

Having raised the issue with @vince62s, Iā€™ll be coming up next monday with guidelines for a full RoPE implementation in Locomotive. Correct versions are the latest ones (CT2 4.3.1 and ONMT 3.5.1), previous versions as well as the PR#1687 have incomplete specs for this position encoding. Torch 2.1.1+cu121 has to be installed manually prior to upgrade (cuda runtime not available through requirements.txt).

The tokenization issue in ONMT 3.5 is quite severe and can hit upon training so I have to check it on at least 4 trainings before, but I have a working fix for train.py (introduce 2 transform arguments, valid_transform and train_transform instead of 1) tested on toy models with RoPE, RPE and default position encoding.

These trainings will also allow to characterize what RoPE brings in practice when translating.

Iā€™ll restore default values for ā€œsrc_seq_lengthā€ and ā€œtgt_seq_lengthā€ (192) too, which should help translating longer sentences.

I think the best strategy is to update the models incrementally, language by language. I recommend starting with the most widely used languages (@pierotofy may have some data on this from LibreTranslate.com). A lot of the most popular language pairs, English-Spanish for example, are still on version 1.0 which are models I trained in 2021 using the OpenNMT-tf default config.

As people train new models just post on this forum and I can update the index.

1 Like

I currently donā€™t do any type of tracking on the translation API calls (as part of the siteā€™s maximum privacy promise), but most traffic we get on the site is from US, India, Germany, China, Spain, Brazil and Russia. Not sure if thatā€™s a good proxy for picking a language pair to improve.

2 Likes

As of language pairs to prioritize, the list of countries looks like a good guideline.

As of rotary position encoding, I still do not have an opinionated point of view because I am only done actually debugging the feature. Turns out, when using onmt-py 3.4.2 and further, flash attention2 is used when training models -bar those with ā€œshawā€ position encoding-. And said flash_attn2 does not support legacy GPUs, only recent ones, but no-one on the ONMT or CT2 projects has thought that maybe some retrocompatibility would be appreciatedā€¦

Past several foul trials and since only Shaw models were functional anymore after the upgrade, I dived into the specs and code (luckyly rotary is already implemented enough in onmt3.4.1 for what Locomotive does) and wrote a fix.

Using the following libraries, one can produce a ā€œropeā€ model with the current Locomotive : just rename the extensions as .py and copy them in their respective python3x/Lib/sites-packages/ctranslate2/converters/ and ctranslate2/specs/ directories in place of the existing files.

opennmt_py.py - 320 rope alibi glu ~ pixeldrain opennmt_py.py - 320 rope alibi glu
transformer_spec.py - 320 rope alibi glu ~ pixeldrain transformer_spec.py - 320 rope alibi glu

Then apply the configuration features given by @lynxpda and itā€™s a go.

Having tried the various possible configuration using RoPE, I cannot recommend it.
Most trainings with gated activation functions exihibited vanishing gradients, others aborted on early stopping, only one went through. Trainings with default relu activation went through but the results are inferior to vanilla DEEP with basic position encoding.
It does not solve at all syntactic issues as a most advanced position encoding should. And last, training is not that fast, convergences start fast but it plateaus afterwards.

Then, it may well be related to me using second generation GPUs, so if your GPU is third generation, feel free to upgrade ONMT to 3.5.1, pytorch to 2.1, CTranslate to 4.3.1, install fast-attn2, modify train.py (to fix a scoring bug in opennmt) and pull the adapted CTranslate2 patch (for rope and gated activation), and prove me wrong. Iā€™ll invest in something more recent.
train.py ~ pixeldrain train.py
opennmt_py.py - 431mergedPR1687 ~ pixeldrain opennmt_py.py - 431mergedPR1687
transformer_spec.py - rope ~ pixeldrain transformer_spec.py - rope
3 files ~ pixeldrain All 3 files

But for now, I will either stick to 148M ā€œDEEPā€ model whose results were the most reliable

enc_layers : 18
dec_layers : 6
transformer_ff : 4096
vocab_size : 32000

or the 161M gated-gelu model with Shaw 20 position encoding (ā€œsiluā€ is almost equivalent, but slightly inferior and has some early stopping issues)

enc_layers : 18
dec_layers : 6
transformer_ff : 3072
vocab_size : 32000
pos_ffn_activation_fn : gated-gelu
position_encoding : false
max_relative_positions : 20

which train slowly but steadily and have better evaluation metrics on my training server than those using rotary position encoding.

1 Like

After training on english-french pairs also, I have come to the conclusion that the training data is really the dominant factor in the modelā€™s quality.

Tried first on a mix of excerpts from CCMAtrix, UNPC, Multi-UN, EuroPat, TEDs and DGT (UE) to which I added the Canadian hansards and Cadlaws. Totalling 55M selected sentences pairs, I came to slightly better results than the existing models, but nothing really exciting : said improvement was due to the hyperparameters.

Then, I reduced the dataset to 25M sentences pairs, keeping only the best excerpts, got a small jump in COMET score of 0.4 points. Nothing too noticeable on a manual evaluation.

Added 20M backtranslated sentences from the Leipzig University Wortschatz corpora. Scores went back to the values I had had with the 55M sentences pairs. So much for backtranslation.

Incidentally, I was instructed to look for an LLM-backed translator, and came across TowerInstruct. The team that developed it explicitly used the reference-free wmt22-cometkiwi-da model to filter their training data. They reach a 0.8824/0.884 COMEt score after training with only 2M sentence pairs per language. For comparison, Google is 0.8922/0.8992, LT1.9 0.863/0.882, and I barely reach 0.873 towards french and 0.890 towards english.

So, I devised a script to calculate the comet scores and sort (~10k a minute, a little more than 10 million a day) sentences pairs according to it. Not a filter, since I have to find out first how they distribute on various corpora but the good threshold value for a filter might be between 0.85 and 0.89 depending on how selective one wants to be.

The goal is to get the best 25M sentence pairs comet-wise and train models on them. For now, the script is single GPU only, but the code can be optimized for multiple GPUs.

1 Like