Help Wanted: Improve en-de translation

Trained on Locomotive and tested two models, version 1.0.6 same parameters as DEEP ru_en/en_ru models (vocab 32000, feed_forward 4096, encoder layers 20, batch size 8192, accu 25, each model is 173MB). 20k train_steps only, but progression is smooth and DE-EN learns very little in the last 10k steps.
Some learning margin on EN_DE though.

Sources (opus)
Alle ELRC-German_Foreign_Offic, CCMatrix (top20%); Open Subtitles (top 70% for weight); EuroPat (top 70% for weight); DGT; EuroParl; EUbookshop

de_en : ppl 9.1881 BLEU 60.64265 (
en_de : ppl 10.2332 BLEU 46.43329

If you need a version with only 2 digits (1.1), I am currently trying to improve on this using the excerpt filter on CCMatrix.


Don’t worry about the version number, I can change it to “1.9” once you have your final model trained.

The models look good! Here’s some text I ran through them:

English Source Text

In the preface to my translation of the “Iliad” I have given my views as to the main principles by which a translator should be guided, and need not repeat them here, beyond pointing out that the initial liberty of translating poetry into prose involves the continual taking of more or less liberty throughout the translation; for much that is right in poetry is wrong in prose, and the exigencies of readable prose are the first things to be considered in a prose translation. That the reader, however, may see how far I have departed from strict construe, I will print here Messrs. Butcher and Lang’s translation of the sixty lines or so of the “Odyssey.” Their translation runs:

- Butler Translation Preface Homer’s Odyssey Project Gutenberg

German Translation (1.0.6)

Im Vorwort zu meiner Übersetzung der “Ilias” habe ich meine Ansichten zu den Hauptprinzipien gegeben, nach denen ein Übersetzer geführt werden sollte, und sie müssen sie hier nicht wiederholen, außer darauf hinzuweisen, dass die anfängliche Freiheit, Poesie in Prosa zu übersetzen, das ständige Nehmen von mehr oder weniger Freiheit während der Übersetzung beinhaltet; denn vieles, was in Poesie richtig ist, ist in Prosa falsch, und die Notwendigkeiten lesbarer Prosa sind die ersten Dinge, die in einer Prosaübersetzung berücksichtigt werden. Daß der Leser jedoch sehen mag, wie weit ich von der strengen Auslegung abgewichen bin, werde ich hier die Herren drucken. Butcher und Langs Übersetzung der sechzig Zeilen oder so der “Odyssee”. Ihre Übersetzung läuft:

English Back Translation (1.0.6)

In the preface to my translation of the “Iliad”, I have given my views on the main principles by which a translator should be guided, and they do not have to repeat them here, except to point out that the initial freedom to translate poetry into prose involves the constant taking of more or less freedom during translation; For much of what is right in poetry is wrong in prose, and the necessities of legible prose are the first things to be considered in a prose translation. However, that the reader may see how far I have departed from the strict interpretation, I will print the gentlemen here. Butcher and Lang’s translation of the sixty lines or so the “Odyssey”. Your translation is ongoing:

They are pretty good, but if I apply all the tricks from my discussions with lynxpda, they could be even better so I am giving it a try.


Two weeks later, I haven’t succeeded in improving the model. I spent the last three days running trains from the raw dataset, and realized there is a pretty faire amount of entropy involved in the final result.
Right now, I am still trying to determine which parameters would discriminate in a few hours a dataset that won’t yield a good model from one that will.
I’ll publish a post next week with what I found out.


I wanted to see what was at play in translation quality…

Curating the data
I finally have been training models on 25M sentences selected on what would be “creme de la creme” in opus for german-english

  • excerpts of CCMatrix (0-4% & 28-32%) and EuroPat (40-50%),
  • DGT, EuroParl,
  • anything from the Federal German Government,
  • QED, TED2020/2013, Global Voices, CORDIS.
    All this is filtered on sentence length (depending on the corpora, I filter under 20 or 30 characters to eliminate titles and over 500 or 1000 to eliminate too long sentences), num and non_alphanum ratio (over 0.2 or 0.3), source to target ratio [0.6;1.7] and digits_mismatch. Also eliminated orphan quotes and brackets.

These data bring the ppl at 10.18, and the COMET score at 0.88 for BASE models, dabbling with hyperparameters brings further improvement : ppl goes down to 8.6 something… training ppl is around 11, less with deep encoders.

As of BLEU, it oscillates between 16/18 and 39-45 on flores200-devtest (quite inconsistently at that), and remains above 50 on flores200-dev (with a max at 69 for a model that uses custom label_smoothing…, whose translations exhibit interesting qualities although COMET ranks it slightly inferior to the same models with default label-smoothing).

I started curating the data for english to german, it looks like the corpora ara treated quite differently but I will apply the same methode and see what comes out of it.

1 Like

Another important point is which LR scheduler you use and with what settings.
Experiments revealed that this is one of the most important parameters in relation to the Global batch size.

Basic LR scheduler in Locomotive, also basic settings for learning rate, warming steps and dropout.

I see models learning fast enough so I didn’t find useful to try anything else.

Regarding LR, you may have noticed that over time the model reaches a plateau and stops improving. With the help of LR you can knock it out of the local minimum and squeeze it even more.
You can also speed up your learning.

1 Like

OK, I noticed actually that after the warming steps, val.BLEU progress is much slower and starts oscillating.
Do you mean we should stop training then and resume with a higher learning rate, or push the learning rate and add warming steps?
I read in a paper by the University of Prague that learning diverges with LR above 0.25, so I figured 0.15 was a good compromise

Yes, there are several LR scheduler strategies, one of them is cyclic changes in LR that allow you to knock the model out of a local minimum. By default, Locomotive used to have a noam scheduler, depending on the warm-up steps (for example, 16000), I was quite comfortable setting LR equal to 3. Here you need to look at the minimum and maximum actual speeds during the training process relative to the total number of training steps and effectively the size of the batch. A too low maximum LR can also be bad and not lead to optimal model parameters.

1 Like

Regarding pre-training, learning transfer and correct selection of the LR sheduler, below is the training log of the new model (Finnish, Estonian, Vepsian to Russian) based on the already existing model (English to Russian).
Considering the size of the model in 450M parameters and training on 2x RTX 3060, I got a good result. At the same time, the speed can be increased approximately 4 times (freezing the decoder in this case and reducing the effective size of the batch):

[2024-05-16 17:09:02,586 INFO] Step 50/20000; acc: 31.9; ppl: 200.1; xent: 5.3; lr: 0.00004; sents:  234020; bsz: 1323/1215/52; 6795/6242 tok/s;    876 sec;
[2024-05-16 17:23:28,696 INFO] Step 100/20000; acc: 36.5; ppl: 132.3; xent: 4.9; lr: 0.00007; sents:  243673; bsz: 1315/1222/54; 6833/6347 tok/s;   1742 sec;
[2024-05-16 17:37:51,586 INFO] Step 150/20000; acc: 41.7; ppl:  91.7; xent: 4.5; lr: 0.00011; sents:  236869; bsz: 1322/1216/53; 6894/6344 tok/s;   2605 sec;
[2024-05-16 17:52:16,419 INFO] Step 200/20000; acc: 50.2; ppl:  51.8; xent: 3.9; lr: 0.00014; sents:  241859; bsz: 1322/1221/54; 6877/6352 tok/s;   3470 sec;
[2024-05-16 18:06:40,540 INFO] Step 250/20000; acc: 55.9; ppl:  36.9; xent: 3.6; lr: 0.00018; sents:  235698; bsz: 1318/1220/52; 6864/6356 tok/s;   4334 sec;
[2024-05-16 18:21:05,068 INFO] Step 300/20000; acc: 58.4; ppl:  32.0; xent: 3.5; lr: 0.00021; sents:  235065; bsz: 1322/1218/52; 6883/6340 tok/s;   5199 sec;
[2024-05-16 18:35:30,901 INFO] Step 350/20000; acc: 60.6; ppl:  28.3; xent: 3.3; lr: 0.00025; sents:  239441; bsz: 1319/1219/53; 6854/6337 tok/s;   6064 sec;
[2024-05-16 18:49:48,189 INFO] Step 400/20000; acc: 62.0; ppl:  26.1; xent: 3.3; lr: 0.00028; sents:  241219; bsz: 1314/1221/54; 6896/6412 tok/s;   6922 sec;
[2024-05-16 19:03:59,091 INFO] Step 450/20000; acc: 63.1; ppl:  24.4; xent: 3.2; lr: 0.00032; sents:  236453; bsz: 1322/1221/53; 6990/6458 tok/s;   7773 sec;
[2024-05-16 19:18:09,557 INFO] Step 500/20000; acc: 64.0; ppl:  23.0; xent: 3.1; lr: 0.00035; sents:  239313; bsz: 1316/1221/53; 6963/6463 tok/s;   8623 sec;

1 Like

OK, so this is why you also implemented prefix, special vocab features… a trilingual model.

Regarding training convergence, which learning rate and warmup steps are used in each curve?

Model Apr-26 (black) - basic model with 450M parameters. Trained with noam with warm-up 16000 and LR = 3.
Model May-16 (blue) is a trilingual model based on the base Apr-26 but with a reset optimizer and updated vocab. noam with a short warm-up of 1000 steps and LR = 0.5. The calculator for the LR planner was posted in the next topic:

This is the same case of short training with transfer of learning from a previously pre-trained model.