Help Wanted: Improve en-de translation

Two things I have noticed on German translation with the 1.0 model

  • somme pretty daily used tokens missing, such as “Lok” not being translated)
  • some currently agglutinated words translate separately

German is actually quite regular (if not simple), both grammaticaly and semantically.
For the same meaning, one word will be used in a context, another in another context (in English it unfortunately often comes down to style), so a dataset of good quality and suficient size should do the trick.

2 Likes

I’ll try to retrain the German models. I want to see what effect can be achieved by tagging mined bitexts (CCMatrix).

2 Likes

Hello,
Began working on it a week ago. Instantiated Locomotive on Windows using a V100 GPU. Tested various parameters, ans so far, my results yield more questions than answers :slight_smile:

  1. What shall I maximize? BLEU or accuracy?
    I obtained the highest BLEU (51.0382) on my vanilla 5k steps trained with 10M sentences from the EU corpora (DGT, EuroParl, …), accuracy was 52.48 and perplexity 41.65
    On the other hand, training for 50ksteps and adding as much trivia from CCMatrix and OpenSubtitles sunk the BLEU to 43.4752, but raised accuracy to 73.8118 and minimized ppl to 12.0746

Other questions follow…

2 Likes

Hello!
Can you attach train settings?

I think the number of training steps should only be discussed in relation to the effective batch size (number of GPUsaccumbatch_size).

The first thing to focus on is valid/ppl (less is better). I think a good indicator on FLORES200 (default in Lokomotiv) valid/ppl would be less than 11 (the final achievable score depends on both the training data set and the size of the model.).
It’s better not to focus on accuracy.

The BLEU assessment is difficult to navigate; in my opinion, valid/ppl correlates very well with subjective quality and does not have such a spread when training different models.

2 Likes

OK, I am not sure what the abbreviations stand for… the Locomotive logs feature several such values
For instance, the checkpoint after the first 5k steps of my vanilla model displayed

valid stats calculation (19.156410932540894 s.)
translation of valid dataset (364.9110949039459 s.)
validation BLEU: 2.6792101151198695
Train perplexity: 119.19
Train accuracy: 37.0487
Sentences processed: 8.99618e+06
Average bsz: 6291/6631/225
Validation perplexity: 41.6536
Validation accuracy: 52.4844
Model is improving ppl: 43.5846 --> 41.6536.
Model is improving acc: 51.5805 --> 52.4844.

Config file follows

{
    "from": {
        "name": "German",
        "code": "de"
    },
    "to": {
        "name": "English",
        "code": "en"
    },
    "version": "1.0.2",
"sources": [
        "file://C:\\Users\\NicoLe\\Nopus",
        "opus://ELRC-417-Swedish_Work_Environ",
		"opus://ELRC-1088-German_Foreign_Offic",
		"opus://ELRC-1089-German_Foreign_Offic",
		"opus://ELRC-1090-German_Foreign_Offic",
		"http://data.argosopentech.com/data-wiktionary-en_de.argosdata",
		"opus://ELRA-W0301",
		"opus://DGT",
		"opus://Europarl",
		"opus://EUbookshop"
    ],
	"save_checkpoint_steps": 250,
    "valid_steps": 250, 
    "train_steps": 5000
1 Like

From what I see:

  1. Validation perplexity: 41.6536 - you should focus on this parameter. The lower the better. Below 12 is already good.
  2. Everything that is not explicitly written in config.json - the default parameters are taken from the train.py file
  3. Apparently, you are training on one GPU, which means the effective batch size = 188192 = 65536 (in train.py: 'batch_size': 8192, 'accum_count': 8,)
    Personally, in my experience, a larger batch size increases the quality of the model, but with diminishing returns. For Transformer BASE models, I got the maximum quality with an effective batch size of 200k (in your case, you can set 'accum_count': 24).
  4. With an effective batch size of 200k and a large dataset (more than 50M pairs of sentences), 70-100k training steps are usually sufficient. At 65k - I think about 200k steps.
  5. You can also increase the size of the model itself (if the dataset is large enough and of high quality, this makes sense):

'transformer_ff': 4096 - increases the ff layer, judging by the preprints of the articles and my observations, it gives the greatest quality gain relative to increasing the size of the model.

'enc_layers': 20 increases the number of encoder layers, together with increasing the ff layer gives the greatest gain in quality.

I provided the calculator and parameters in this post:

If you increase the size of the model, do not forget to use ‘batch_size’ and ‘accum_count’ to set the effective batch size so that everything fits in your VRAM.

2 Likes

Also, I followed on your attached post : in my current run, I tweaked train.py to implement these parameters

   'enc_layers': 20, 
    'dec_layers': 6,
    'heads': 8,
    'hidden_size': 1024, 
    'rnn_size': 1024, #figured it should equate hidden_size... ok?
    'word_vec_size': 1024,  #the same...
    'transformer_ff': 4096,

and overflowed the VRAM, so I halved “batch_size” to 4096 and doubled “train_steps” to 40k… but from your last post I realize I should raise ‘accum_count’ instead.

I also consider writing a “middling” filter to allow for expurging top and bottom percents of a corpus (your issue on CCMatrix) using Locomotive.

It is better not to touch the train.py file, but simply add changeable parameters to config.json.

And regarding batch size, decreasing it with increasing number of steps is not equivalent to training with a larger batch size. If you reduce batch_size, then it is better to increase accum_count by the same amount, then these will be equivalent settings.

Also, with 'enc_layers': 20 it is better to set

'hidden_size': 512,
'rnn_size': 512, #figured it should equal hidden_size... ok?
'word_vec_size': 512, #the same...

Otherwise it will be a REALLY big model)

in the table, the hidden size for the “DEEP” 159M model is 1024, therefore the values.
Does it really not matter?
Or matters less?

In which table did you see this? Can you provide a link?
This is a clear mistake.
The hidden_size (d_model) parameter is a very important parameter, as far as I understand, it determines the dimension of the vectors and indirectly affects the size of the attention heads.
However, I don’t remember using a deep layered model at this size, it would have turned out too huge.

Perhaps you mean Transformer BIG, which I referred to in earlier posts, but the number of encoder layers there did not exceed 6 (vanilla BIG transformer).

However, it later turned out that the BASE model can be scaled not only in width (ff, d_model) but in depth, which also has its advantages.

At the level of intuition, if we completely simplify the architecture to a couple of phrases:

  1. The transformer consists of two large blocks - an encoder and a decoder:
    • The encoder processes the input sequence and encodes it into an internal attention-aware multidimensional representation.
    • The decoder turns the multidimensional representation into a sequence of tokens (then and words) and works like ChatGPT, producing the most likely next token depending on the previous ones and data from the encoder.
  2. The encoder and decoder consist of many identical layers knitted together.
  3. The size of each layer, in turn, is determined by the hidden_size and feedforward parameters (the number of attention heads does not affect the size, but depends on the hidden_size size):
    • hidden_size is actually the dimension of the vector that encodes the tokens/sequence; the larger the dimension, the more accurately the coordinates of words/tokens in space can be conveyed. You can more accurately take into account synonyms, meaning, context, etc.
    • feedforward in each encoder/decoder layer has one layer and is connected to the attention layer (hidden_size) at the output. It introduces nonlinearity into encoding/decoding and probably has a partial memory function (I heard that in ChatGPT it is responsible for memory and conversion)
    • the number of attention heads, everything is simple here, the optimal number of heads = hidden_size/64
  4. The number of encoder layers can be changed, and the size of the model will grow almost linearly, depending on the increase in the number of layers.
  5. It is better not to touch the number of decoder layers; they affect the quality of the model almost in the same way as the encoder layers, but the performance of the model during inference drops very much.
  6. Increasing feedforward from 2048 to 4096 (or 8192) gives the greatest increase in quality relative to the increase in model size.
  7. Increasing the number of encoder layers from 6 to 20 also significantly increases the quality, with a linear increase in the size of the model.
  8. Increasing hidden_size significantly improves quality, but at the same time the increase in model size is almost exponential.

Accordingly, we have to look for compromises between these parameters depending on the task and the amount of data.

p.s. I might have oversimplified something.

2 Likes

Maybe this will help, on one GPU I would train config.json in this configuration:

{
    "from": {
        "name": "German",
        "code": "de"
    },
    "to": {
        "name": "English",
        "code": "en"
    },
    "version": "1.0.2",
"sources": [
        "file://C:\\Users\\NicoLe\\Nopus",
        "opus://ELRC-417-Swedish_Work_Environ",
		"opus://ELRC-1088-German_Foreign_Offic",
		"opus://ELRC-1089-German_Foreign_Offic",
		"opus://ELRC-1090-German_Foreign_Offic",
		"http://data.argosopentech.com/data-wiktionary-en_de.argosdata",
		"opus://ELRA-W0301",
		"opus://DGT",
		"opus://Europarl",
		"opus://EUbookshop"
    ],
    "batch_size": 4096,
    "accum_count": 25,
    "warmup_steps": 16000,
    "train_steps": 100000,
    "learning_rate": 2,
    "vocab_size": 32000,
    "avg_checkpoints": 8,
    "src_seq_length": 185,
    "tgt_seq_length": 185,
    "enc_layers": 20,
    "dec_layers": 6,
    "heads": 8,
    "hidden_size": 512,
    "word_vec_size": 512,
    "transformer_ff": 4096,
    "save_checkpoint_steps": 1000,
    "valid_steps": 2500,
    "num_workers": 6,
    "valid_batch_size": 64,
    "bucket_size": 32768,
    "early_stopping": 0,
    "dropout": 0.3
}

Oops, I misread the parameters. Training with “hidden” = 1024 exploded in flight, with a validation pppl at “nan” after 4k steps and yielded a 450M model.
Thanks for the explanation, everything is clear now.
I run a new train.py.

	"vocab_size": 32000,
	"save_checkpoint_steps": 500,
    "valid_steps": 500, 
    "train_steps": 20000,
#	"batch_size": 4096,
	"accum_count": 25,
	"enc_layers": 20, 
	"transformer_ff": 4096

NB : Change the “train_steps” value then run it again, training resumes from 20000 to the value.

Sorry, I didn’t quite understand.
train_steps is the total number of training steps.

Yes, I forgot to say, if you change parameters that affect the size of the model or the learning rate, then the training needs to be restarted from scratch. (other parameters, such as batch size, accum, training_steps, can be changed during the training process.)

@NicoLe Then I will not train the EN_DE model, so as not to duplicate the work.
I am ready to assist in any questions that arise.

2 Likes

I realized it too.
Upon running 1.0.4, I lowered batch size and raised train steps values.
Training resumed where it had ended. But after adding a source, I had to retrain from scratch.
I will keep you posted about the results.

2 Likes

Trained on Locomotive and tested two models, version 1.0.6 same parameters as DEEP ru_en/en_ru models (vocab 32000, feed_forward 4096, encoder layers 20, batch size 8192, accu 25, each model is 173MB). 20k train_steps only, but progression is smooth and DE-EN learns very little in the last 10k steps.
Some learning margin on EN_DE though.

Sources (opus)
Alle ELRC-German_Foreign_Offic, CCMatrix (top20%); Open Subtitles (top 70% for weight); EuroPat (top 70% for weight); DGT; EuroParl; EUbookshop

de_en : ppl 9.1881 BLEU 60.64265 (eval.py)
en_de : ppl 10.2332 BLEU 46.43329

If you need a version with only 2 digits (1.1), I am currently trying to improve on this using the excerpt filter on CCMatrix.

3 Likes

Don’t worry about the version number, I can change it to “1.9” once you have your final model trained.

The models look good! Here’s some text I ran through them:

English Source Text

In the preface to my translation of the “Iliad” I have given my views as to the main principles by which a translator should be guided, and need not repeat them here, beyond pointing out that the initial liberty of translating poetry into prose involves the continual taking of more or less liberty throughout the translation; for much that is right in poetry is wrong in prose, and the exigencies of readable prose are the first things to be considered in a prose translation. That the reader, however, may see how far I have departed from strict construe, I will print here Messrs. Butcher and Lang’s translation of the sixty lines or so of the “Odyssey.” Their translation runs:

- Butler Translation Preface Homer’s Odyssey Project Gutenberg

German Translation (1.0.6)

Im Vorwort zu meiner Übersetzung der “Ilias” habe ich meine Ansichten zu den Hauptprinzipien gegeben, nach denen ein Übersetzer geführt werden sollte, und sie müssen sie hier nicht wiederholen, außer darauf hinzuweisen, dass die anfängliche Freiheit, Poesie in Prosa zu übersetzen, das ständige Nehmen von mehr oder weniger Freiheit während der Übersetzung beinhaltet; denn vieles, was in Poesie richtig ist, ist in Prosa falsch, und die Notwendigkeiten lesbarer Prosa sind die ersten Dinge, die in einer Prosaübersetzung berücksichtigt werden. Daß der Leser jedoch sehen mag, wie weit ich von der strengen Auslegung abgewichen bin, werde ich hier die Herren drucken. Butcher und Langs Übersetzung der sechzig Zeilen oder so der “Odyssee”. Ihre Übersetzung läuft:

English Back Translation (1.0.6)

In the preface to my translation of the “Iliad”, I have given my views on the main principles by which a translator should be guided, and they do not have to repeat them here, except to point out that the initial freedom to translate poetry into prose involves the constant taking of more or less freedom during translation; For much of what is right in poetry is wrong in prose, and the necessities of legible prose are the first things to be considered in a prose translation. However, that the reader may see how far I have departed from the strict interpretation, I will print the gentlemen here. Butcher and Lang’s translation of the sixty lines or so the “Odyssey”. Your translation is ongoing:

They are pretty good, but if I apply all the tricks from my discussions with lynxpda, they could be even better so I am giving it a try.

2 Likes

Two weeks later, I haven’t succeeded in improving the model. I spent the last three days running trains from the raw dataset, and realized there is a pretty faire amount of entropy involved in the final result.
Right now, I am still trying to determine which parameters would discriminate in a few hours a dataset that won’t yield a good model from one that will.
I’ll publish a post next week with what I found out.

3 Likes