The problem of translation quality and its solution

I propose to discuss in this topic a plan to solve the translation quality problem.
As you know, many standard models on 65M parameters, due to a number of reasons have poor translation quality.
I propose to consider this option to solve this problem:

  1. Revise existing models and compare their performance with SOTA or at least Google Translate. This can be done automatically using COMET-22 and SacreBLEU (optional).
  2. Make a list of languages whose models need to be updated (re-trained).
  3. Agree on the maximum allowable size of the model, also agree on its architecture. In order to end up with a generic model with a maximum quality payoff relative to its size.
  4. Train the base models EN_XXX and XXX_EN.
  5. Create a script to transfer the training from base models to target models.
  6. Train new models - use base models to transfer training to reduce training time and cost.
1 Like

I like the approach. One thing that I think is really important, is to keep model size small (as much as possible). People should be able to run the software on commodity hardware and at decent speed.


Keeping the models as small as possible while upgrading quality, and finding a generic architecture able to do that in order to concentrate on the content afterwards is definitely what I am trying to do these last few months.

As of now, I have models that have 148M parameters and can be trained on the current Locomotive from a not so perfect dataset that anyone can create using Locomotive (when selecting the data on opus, it helps to know both languages well…). They already supersede or come to par with many automatic translators widely used in the industry. Data selection does 60% of the job, the architecture 40%.

Having run two or three trainings simultaneously on more than two datasets for the last 3 months, I came to be quite certain of the point. I cannot be sure that bigger models are not superior in terms of architecture, but on my datasets, the ones I have trained up to 400M parameters brought in the better case scenario only marginal improvement (less than one improved translation every 1000 characters) compared to the 148M ones. They also trained more slowly, some even infered more slowly, and the majority gave worse results.

A further improvement pioneered by lynxpda requires modifying both and the Ctranslate2 dependency (PR in the making by another contributor, stable version awaited), and brings us at par with market leaders. I follow him closely to see where it leads me on the english-german language pair.

Once these improvements have been applied, the models I synthetized weigh 199M and have shallow decoder, so they’re quite fast.

Also, I’ve begun characterizing which data can be used in the french-english pair to prove the point. Since the quality is already quite good, I try to see how much better it is possible to go.


Yeah, I’d have to agree. In my research I also came to the conclusion that the optimal model size is 150M ± with deep encoder and shallow decoder.

In my opinion, this is the optimal configuration:

dec_layers: 18
decoder_type: transformer
enc_layers: 6
encoder_type: transformer
heads: 8
hidden_size: 512
max_relative_positions: -1 # RoPE
model_dtype: fp16
pos_ffn_activation_fn: gated-gelu
position_encoding: false
share_decoder_embeddings: true
share_embeddings: true
share_vocab: true
src_vocab_size: 32000
tgt_vocab_size: 32000
transformer_ff: 3072 # 4608 effective
word_vec_size: 512

As for testing, a little later I will write a script that should automate this process.

1 Like

The configuration looks good, but it’s quite difficult to implement (see the post about en-de model : to enable RoPE on Locomotive, you have to trust your luck a lot :slight_smile:).

People who do not have the knowledge to go around this may appreciate the alternative Shaw configuration with max_relative_positions = 20 or 32, which trains slower but should give comparable results.

For the transformer_ff, widening the transformer from 2048 to 4096 gave the biggest boost to translation accuracy compared with default or higher values, but I bypassed 3072, so I’ve got to check.

Gated activation function ‘silu’ may also do the job ‘gated-gelu’ does, @lynxpda got bad results on en-ru with it, mine are better than gated-gelu on de-en, do not know why, maybe related to the language pair (which is why I try training en-fr with these now). The Noam Shazeer preprint we base ourselves upon mentions a third decimal difference between them, so I think both are worth trying.
Note that for the time being, those gated functions only give a valid model when tweaking the CT2 converter and training with RPE or RoPE.

1 Like

150M seems reasonable to me. Downloading all of the models is a major bottleneck for a lot of LibreTranslate users but having high quality translations is also very important.

The models also don’t need to all be the same size, we should probably prioritize improving the most used language pairs.


For the rationale of implementing rotary position encoding, read the following.

I’m trying to get a fix, the error I obtain with onmt 3.5 on valid.BLEU calculation might come from empty tokens generated by the split(" ") method introduced earlier this year in onmt train in place of the legacy split() which dumped empty string.

From my discussion on Github with @vince62s, who’s one of the main contributors to the OpenNMT-py and CTranslate2 projects, there’s effectively a bug in whitespace tokenization in onmt v3.5.

As of training with RoPE using onmt 3.4.3, I have the following observation : in the absence of rotary_theta parameter, positional encondig is rendered unambiguously only on contiguous dimensions.

Hence, any token used in both source and target language with different acceptions or constructs may eventually make training divergent.

I had tested “rotary 3.4.3” on both german->english and english->french to eliminate dataset and stanza library causes. Both training diverged around the end of the warming period. All three languages have quite a few tokens in common with different meanings or placement rules.

On more distant language pairs, with close to none such tokens, one may be lucky and training may converge, but it will amount to relative positional encoding i guess.

Having raised the issue with @vince62s, I’ll be coming up next monday with guidelines for a full RoPE implementation in Locomotive. Correct versions are the latest ones (CT2 4.3.1 and ONMT 3.5.1), previous versions as well as the PR#1687 have incomplete specs for this position encoding. Torch 2.1.1+cu121 has to be installed manually prior to upgrade (cuda runtime not available through requirements.txt).

The tokenization issue in ONMT 3.5 is quite severe and can hit upon training so I have to check it on at least 4 trainings before, but I have a working fix for (introduce 2 transform arguments, valid_transform and train_transform instead of 1) tested on toy models with RoPE, RPE and default position encoding.

These trainings will also allow to characterize what RoPE brings in practice when translating.

I’ll restore default values for “src_seq_length” and “tgt_seq_length” (192) too, which should help translating longer sentences.

I think the best strategy is to update the models incrementally, language by language. I recommend starting with the most widely used languages (@pierotofy may have some data on this from A lot of the most popular language pairs, English-Spanish for example, are still on version 1.0 which are models I trained in 2021 using the OpenNMT-tf default config.

As people train new models just post on this forum and I can update the index.

1 Like

I currently don’t do any type of tracking on the translation API calls (as part of the site’s maximum privacy promise), but most traffic we get on the site is from US, India, Germany, China, Spain, Brazil and Russia. Not sure if that’s a good proxy for picking a language pair to improve.