The problem of translation quality and its solution

The configuration looks good, but it’s quite difficult to implement (see the post about en-de model : to enable RoPE on Locomotive, you have to trust your luck a lot :slight_smile:).

People who do not have the knowledge to go around this may appreciate the alternative Shaw configuration with max_relative_positions = 20 or 32, which trains slower but should give comparable results.

For the transformer_ff, widening the transformer from 2048 to 4096 gave the biggest boost to translation accuracy compared with default or higher values, but I bypassed 3072, so I’ve got to check.

Gated activation function ‘silu’ may also do the job ‘gated-gelu’ does, @lynxpda got bad results on en-ru with it, mine are better than gated-gelu on de-en, do not know why, maybe related to the language pair (which is why I try training en-fr with these now). The Noam Shazeer preprint we base ourselves upon mentions a third decimal difference between them, so I think both are worth trying.
Note that for the time being, those gated functions only give a valid model when tweaking the CT2 converter and training with RPE or RoPE.

1 Like