Help Wanted: Improve en-de translation

Trained on Locomotive and tested two models, version 1.0.6 same parameters as DEEP ru_en/en_ru models (vocab 32000, feed_forward 4096, encoder layers 20, batch size 8192, accu 25, each model is 173MB). 20k train_steps only, but progression is smooth and DE-EN learns very little in the last 10k steps.
Some learning margin on EN_DE though.

Sources (opus)
Alle ELRC-German_Foreign_Offic, CCMatrix (top20%); Open Subtitles (top 70% for weight); EuroPat (top 70% for weight); DGT; EuroParl; EUbookshop

de_en : ppl 9.1881 BLEU 60.64265 (eval.py)
en_de : ppl 10.2332 BLEU 46.43329

If you need a version with only 2 digits (1.1), I am currently trying to improve on this using the excerpt filter on CCMatrix.

3 Likes

Don’t worry about the version number, I can change it to “1.9” once you have your final model trained.

The models look good! Here’s some text I ran through them:

English Source Text

In the preface to my translation of the “Iliad” I have given my views as to the main principles by which a translator should be guided, and need not repeat them here, beyond pointing out that the initial liberty of translating poetry into prose involves the continual taking of more or less liberty throughout the translation; for much that is right in poetry is wrong in prose, and the exigencies of readable prose are the first things to be considered in a prose translation. That the reader, however, may see how far I have departed from strict construe, I will print here Messrs. Butcher and Lang’s translation of the sixty lines or so of the “Odyssey.” Their translation runs:

- Butler Translation Preface Homer’s Odyssey Project Gutenberg

German Translation (1.0.6)

Im Vorwort zu meiner Übersetzung der “Ilias” habe ich meine Ansichten zu den Hauptprinzipien gegeben, nach denen ein Übersetzer geführt werden sollte, und sie müssen sie hier nicht wiederholen, außer darauf hinzuweisen, dass die anfängliche Freiheit, Poesie in Prosa zu übersetzen, das ständige Nehmen von mehr oder weniger Freiheit während der Übersetzung beinhaltet; denn vieles, was in Poesie richtig ist, ist in Prosa falsch, und die Notwendigkeiten lesbarer Prosa sind die ersten Dinge, die in einer Prosaübersetzung berücksichtigt werden. Daß der Leser jedoch sehen mag, wie weit ich von der strengen Auslegung abgewichen bin, werde ich hier die Herren drucken. Butcher und Langs Übersetzung der sechzig Zeilen oder so der “Odyssee”. Ihre Übersetzung läuft:

English Back Translation (1.0.6)

In the preface to my translation of the “Iliad”, I have given my views on the main principles by which a translator should be guided, and they do not have to repeat them here, except to point out that the initial freedom to translate poetry into prose involves the constant taking of more or less freedom during translation; For much of what is right in poetry is wrong in prose, and the necessities of legible prose are the first things to be considered in a prose translation. However, that the reader may see how far I have departed from the strict interpretation, I will print the gentlemen here. Butcher and Lang’s translation of the sixty lines or so the “Odyssey”. Your translation is ongoing:

They are pretty good, but if I apply all the tricks from my discussions with lynxpda, they could be even better so I am giving it a try.

2 Likes

Two weeks later, I haven’t succeeded in improving the model. I spent the last three days running trains from the raw dataset, and realized there is a pretty faire amount of entropy involved in the final result.
Right now, I am still trying to determine which parameters would discriminate in a few hours a dataset that won’t yield a good model from one that will.
I’ll publish a post next week with what I found out.

3 Likes

I wanted to see what was at play in translation quality…

Curating the data
I finally have been training models on 25M sentences selected on what would be “creme de la creme” in opus for german-english

  • excerpts of CCMatrix (0-4% & 28-32%) and EuroPat (40-50%),
  • DGT, EuroParl,
  • anything from the Federal German Government,
  • QED, TED2020/2013, Global Voices, CORDIS.
    All this is filtered on sentence length (depending on the corpora, I filter under 20 or 30 characters to eliminate titles and over 500 or 1000 to eliminate too long sentences), num and non_alphanum ratio (over 0.2 or 0.3), source to target ratio [0.6;1.7] and digits_mismatch. Also eliminated orphan quotes and brackets.

These data bring the ppl at 10.18, and the COMET score at 0.88 for BASE models, dabbling with hyperparameters brings further improvement : ppl goes down to 8.6 something… training ppl is around 11, less with deep encoders.

As of BLEU, it oscillates between 16/18 and 39-45 on flores200-devtest (quite inconsistently at that), and remains above 50 on flores200-dev (with a max at 69 for a model that uses custom label_smoothing…, whose translations exhibit interesting qualities although COMET ranks it slightly inferior to the same models with default label-smoothing).

I started curating the data for english to german, it looks like the corpora ara treated quite differently but I will apply the same methode and see what comes out of it.

1 Like

Another important point is which LR scheduler you use and with what settings.
Experiments revealed that this is one of the most important parameters in relation to the Global batch size.

Basic LR scheduler in Locomotive, also basic settings for learning rate, warming steps and dropout.

I see models learning fast enough so I didn’t find useful to try anything else.

Regarding LR, you may have noticed that over time the model reaches a plateau and stops improving. With the help of LR you can knock it out of the local minimum and squeeze it even more.
You can also speed up your learning.

1 Like

OK, I noticed actually that after the warming steps, val.BLEU progress is much slower and starts oscillating.
Do you mean we should stop training then and resume with a higher learning rate, or push the learning rate and add warming steps?
I read in a paper by the University of Prague that learning diverges with LR above 0.25, so I figured 0.15 was a good compromise

Yes, there are several LR scheduler strategies, one of them is cyclic changes in LR that allow you to knock the model out of a local minimum. By default, Locomotive used to have a noam scheduler, depending on the warm-up steps (for example, 16000), I was quite comfortable setting LR equal to 3. Here you need to look at the minimum and maximum actual speeds during the training process relative to the total number of training steps and effectively the size of the batch. A too low maximum LR can also be bad and not lead to optimal model parameters.

1 Like

Regarding pre-training, learning transfer and correct selection of the LR sheduler, below is the training log of the new model (Finnish, Estonian, Vepsian to Russian) based on the already existing model (English to Russian).
Considering the size of the model in 450M parameters and training on 2x RTX 3060, I got a good result. At the same time, the speed can be increased approximately 4 times (freezing the decoder in this case and reducing the effective size of the batch):

[2024-05-16 17:09:02,586 INFO] Step 50/20000; acc: 31.9; ppl: 200.1; xent: 5.3; lr: 0.00004; sents:  234020; bsz: 1323/1215/52; 6795/6242 tok/s;    876 sec;
[2024-05-16 17:23:28,696 INFO] Step 100/20000; acc: 36.5; ppl: 132.3; xent: 4.9; lr: 0.00007; sents:  243673; bsz: 1315/1222/54; 6833/6347 tok/s;   1742 sec;
[2024-05-16 17:37:51,586 INFO] Step 150/20000; acc: 41.7; ppl:  91.7; xent: 4.5; lr: 0.00011; sents:  236869; bsz: 1322/1216/53; 6894/6344 tok/s;   2605 sec;
[2024-05-16 17:52:16,419 INFO] Step 200/20000; acc: 50.2; ppl:  51.8; xent: 3.9; lr: 0.00014; sents:  241859; bsz: 1322/1221/54; 6877/6352 tok/s;   3470 sec;
[2024-05-16 18:06:40,540 INFO] Step 250/20000; acc: 55.9; ppl:  36.9; xent: 3.6; lr: 0.00018; sents:  235698; bsz: 1318/1220/52; 6864/6356 tok/s;   4334 sec;
[2024-05-16 18:21:05,068 INFO] Step 300/20000; acc: 58.4; ppl:  32.0; xent: 3.5; lr: 0.00021; sents:  235065; bsz: 1322/1218/52; 6883/6340 tok/s;   5199 sec;
[2024-05-16 18:35:30,901 INFO] Step 350/20000; acc: 60.6; ppl:  28.3; xent: 3.3; lr: 0.00025; sents:  239441; bsz: 1319/1219/53; 6854/6337 tok/s;   6064 sec;
[2024-05-16 18:49:48,189 INFO] Step 400/20000; acc: 62.0; ppl:  26.1; xent: 3.3; lr: 0.00028; sents:  241219; bsz: 1314/1221/54; 6896/6412 tok/s;   6922 sec;
[2024-05-16 19:03:59,091 INFO] Step 450/20000; acc: 63.1; ppl:  24.4; xent: 3.2; lr: 0.00032; sents:  236453; bsz: 1322/1221/53; 6990/6458 tok/s;   7773 sec;
[2024-05-16 19:18:09,557 INFO] Step 500/20000; acc: 64.0; ppl:  23.0; xent: 3.1; lr: 0.00035; sents:  239313; bsz: 1316/1221/53; 6963/6463 tok/s;   8623 sec;

1 Like

OK, so this is why you also implemented prefix, special vocab features… a trilingual model.

Regarding training convergence, which learning rate and warmup steps are used in each curve?

Model Apr-26 (black) - basic model with 450M parameters. Trained with noam with warm-up 16000 and LR = 3.
Model May-16 (blue) is a trilingual model based on the base Apr-26 but with a reset optimizer and updated vocab. noam with a short warm-up of 1000 steps and LR = 0.5. The calculator for the LR planner was posted in the next topic:

This is the same case of short training with transfer of learning from a previously pre-trained model.

Gelu activated model is underperforming relu on automated metrics, much as the one with modified label smoothing, or bigger lexica. Still have to check why : for instance, I found label_smoothing = 0.11 to yield somewhat sharper wording (than reference translations too, thus the automated score was not so good…).

Also finished training a model with Shaw20 (RPE) position encoding. Although training data is the best obtained so far, automated metrics do not show such an improvement (COMET is up 0.0003, BLEU is either down or at par). I’ll also have to check why for myself.

And I reread the Shazeer article introducing gated activation functions, and they actually reduce, not increase, the ff layer size to keep parameters’ number constant across the whole transformer.

We use the same code base, model architecture, and training task as the base model from [Raffel et al., 2019]. The encoder and decoder each consist of 12 layers, with dmodel = 768. For the attention layers, h = 12 and dk = dv = 64. The FFN layers have hidden size df f = 3072. As we describe above, for the GLU-variant-based FFN layers, which have thee weight matrices instead of two, we reduce the hidden layer to df f = 2048, so as to maintain the same parameter and operation counts as the base model.

Since we use dmodel = 512, we simply cannot downsize the ff size by 2/3, so I simply started training with the same ff this morning. And I used “silu” actuvation, which is actually a “SwiGLU” in Open-NMT. According to Shazeer, it should yield >95% of the “gated-gelu” activation’s results, and spares the trouble of having to tweak dependencies.

All in all, the model with 18 encoder layers and 6 decoder layers has 199M parameters, against 278 if you increase ff to 6144. I’ll check the relation between ff size and quality later, switching fundamental parameters may have altered it pretty much.

As of tweaking, a PR is due in Locomotive to allow for activation and position encoding to be treated normally with config.json file. Before that, one has to check non-regression when using default values (since we had to uncomment “max_relative_positions” in train.py).

Regarding the number of parameters of GLU functions, maybe I didn’t explain it correctly…
GLU functions for FF use 3 parameters instead of 2 for regular functions like RELU/GELU. Accordingly, the number of model parameters is equivalent for
gated-gelu(2048) = RELU(3072).

Regarding the SILU function in OpenNMT-py, everything is generally confused with the terminology.(

SILU is just an activation function like GELU, but the gated version (GLU) should not be called SILU but SiGLU/SwiGLU or gated-silu at least.

Correctly, there is a SwiGLU or gated-silu function there. There (should be) an additional configurable (or teachable) parameter β, but in fact it is not there.
I tried training the model with the default settings for silu in OpenNMT-py and got terrible results.

Well, I also have launched gated-gelu yesterday on my dev instance, only with ff 6144 i’ve had pretty much of a memory issue and had to split the batch pretty much. That prompted me to reread the article, hence my decision this morning not to tweak my main instance right now.

If silu does not yield, well, I’ll have to adapt. The problem is that the gated-gelu is only implemented for seq2seq models with PR #1687, said PR has not been merged yet, let alone a stable version including the feature published.

Which is OK when you do research, but research is not exactly my field…

Yes, PR #1687 will be merged later, more edits are needed there in the C code for full AliBi support. After this merge I will create a PR to add similar features to Locomotive.

I seem to have missed the point, do you use synthetic back-translated corpora and to what extent. For me, it seems to have provided the main quality boost, aside from increasing the model size (works well together). It was only later that I tried to squeeze out the leftovers through hyperparameter settings and other tricks.

As far as back-translation is concerned, it is a convenient way to augment the training data’s scope and quantity, but I am concerned about the bias it may introduce.

From a qualified source, I’ve had the info that roughly a third of automatically translated documents in the english-german language pair are unsuitable for use. I heard: “Say you add more to the training data, by what miracle would it lead to a more useful translation?” and had to convince that there was simply not enough humanly well-translated data available. But well, they had a point.

So I took a try with a maximum of the latter, augmented with excerpts from machine-translated corpora selected on the capacity to be verisimilar (as measured by BLEU) at an early stage of the training (that is, before the model can develop bias), and although the dataset is only 25M sentences, it yields a 148M parameters model scoring a pretty good 0.8896 COMET before trying geglu and RPE.

From the looks of it, there is little difference between real language and the translations from Google or any of our best models. Except real language is what the natives speak and write, whereas back-translations are “translationese”.

For instance, “Bathed in the afternoon sun, the small town…” ‘translates to’ "Купаясь в дневном солнце, маленький городок… "… How do you say? “Ne po-russky” :slight_smile: the small town your average Joe in a bathing suit, toes dipped into the sun.

I’ve read plenty of such badly translated sentences but what to do with it? If the s***r does not have the correct expression for a small town that bathes in its original training data, it has no chance at learning it, nor will it feature in any back-translation, so it’ll take the one for Joe.

There’s no knowing for sure how DeepL gets “Залитый полуденным солнцем” which is very much po-russky, but a good guess is that they train from scratch on duly paid for, copyrighted content translated long ago by professionals. Not back-translated one.

Ugh, it seems the back-translation idea doesn’t work that way and it’s not just needed to increase the amount of data!

To train the EN_DE model, you take original real sentences in German. Very high quality ones.
You translate them into English and use them to train the model to translate them into German.

The point is that the model learns from noisy and less than perfect data, translated into a semantic vector representation to generate a translation that is of high quality and as close to the real language as possible.

With this technique, we train the model to be robust to noise and teach the decoder to generate the most accurate sequence possible on the original qualitative sentences of the target language.

There are actually a number of papers on arxiv devoted to this effect. One can easily adapt the desired domain and get a very impressive quality gain.
It is also possible to do this process ineratively in two directions at once.

In addition, this technique is combined with prefix tagging, in which case the decoder is trained, but the model’s bias is removed and it learns to distinguish synthetic sentences from real sentences.

I highly recommend to read these works:

https://www.semanticscholar.org/paper/Iterative-Back-Translation-for-Neural-Machine-Hoang-Koehn/0669f0a031cfaada55841e5962eb6796d4e94971

https://www.semanticscholar.org/paper/Tagged-Back-translation-Revisited%3A-Why-Does-It-Work-Marie-Rubino/d141914b07dee69d8ae0e87da25b4e3bb2b80029

1 Like

By the way, more on the humanity of the written text. I think the method of sampling during generation plays a big role here. In DeepL everything is fine with it, given the choice of options when translating (whole sentences or individual words).

By default argostranslate has a 4 beam search method. the text is accurate, but dry and machine-like, which is a known problem.

However, Ctranslate2 also supports sampling randomly with temperature, TOP-K, TOP-P settings. Ideally, these would be added when requesting a translation via the Libretranslate API, combined with beam search (with customization - only beam search by default).

On the downside, the result may be less accurate and non-deterministic.
On the plus side, the text is less dry, more human and more pleasant to read.

I am currently experimenting with these methods of generation, but it’s probably a separate topic.

1 Like