Basically, one has to add the following 3 lines to their configuration:
"position_encoding": "False",
"max_relative_positions": -1,
"pos_ffn_activation_fn": "silu",
That said, it’s “RoPE/GLU 101”… Now for 102:
- “silu” in opennmt-py and CT2 is actually coded as a GLU function
- there’s another, slightly more efficient GLU (“gated-gelu”) but it’s compiled in opennmt only,
- RoPE is also not wholly supported in the CT2 converter so CT2 requires tweaking to make it work (pleas make sure you do this safely, use an IDE with “Deployment” option.
For more details, have a look at this issue and the subsequent pull request
NB: PR#1687 applied to CT2==4.2.1, rebase it on 4.3.1 before deployment.
Once this is in place, you should be able to run RoPE/GLU: using “gated-gelu” instead of “silu” yields a few nicer translations, the two functions are 98+% similar according to N.Shazeer & al. and we’ve got more urgent problems… that’s RoPE/GLU103
We’ve had an esoterical debate about it last year, I eventually used “gated-gelu” and modified the CT2 converter because it yielded best performances more consistently. However, my best de-en checkpoint ever used “silu”, although it’s been a non-reproducible lucky strike.
Using RoPE/GLU will also completely change the way training runs, because of flash-attention, so further modifications are needed: vanilla scheduling is much too fast for flash-attention.
- the learning rate (0.15) is too high, it has to be reduced to something much lower (0.05, 0.02, 0.375, I look for it)
- the warming period is too long, flash-attention reaches suboptimal values slightly after 9k steps, and with basic LR, the model freezes definitely…
- with LR=0.05, the model freezes from 9k until 16k, with better though suboptimal values, then goes in stalled patience between 16k and 18k while metrics jump back, and improves anew…
Since the formula is LR/sqrt(N/ws) reducing the warming steps need rising LR to get the same curve. So common sense dictates adding this to the config:
"learning_rate": 0.0375,
"warmup_steps": 9000,
However, even these values throttle training: the throttling point appears at 8k steps, so I tried 0.02, to the same avail…
At this point, chnaging the scheduler appears like the best option available. @lynxpda used the scheduler invented by N. Shazeer: the formula is different, hence a learning curve that looks much like the one from above params, only between 0 and 9k, it spikes from 0 and plummets after warming instead of remaining constant at LR/sqrt(ws) (0.00119 in vanilla vs. 0.0004 supra).
"decay_method": "noam",
"learning_rate": 0.5,
"warmup_steps": 1000,
This is equivalent to using 0.02 on ‘rsqrt’, except the short warming period gives training an initial oomph (@ step 1000, effective learning rate will rise to 4 times the ‘rsqrt’ plateau). So whereas ‘rsqrt’ 0.0375 does not converge optimally, 0.5 does.
However, it takes ages to do so, and it takes a very neat dataset for the training not to go into early stopping before (which is what I train onto, but using features that incur added runtime and which license is more restrictive than the current LT license).
So, I am looking for the parameters that would also work well for the community and edit this post accordingly.
Insofar, noam decay allows training “BIG” transformers where tokens are encoded on 1024-dim vectors, while rsqrt only allows training 768-dim vectors, theoretically large enough but not the industry’s standard.
As of the differences between systems (Windows and Linux have small differences in processing seeds and some torch features also process differently on the systems), while there are manifest differences between training metrics on two experiences run in parallel, the converged models exhibit the same (within the error margin of COMET metrics) characteristics at evaluation.
Finalizing the models with “avg_checkpoints”: 3 or 5, (if training ends on an early stopping, 5 averages all the checkpoints from the best metrics on) averages the latest checkpoints and cancels those training discrepancies.
After a few trials, the most efficient scheduler seems (optimum in range .754 to 0.9) to be:
"avg_checkpoints": 3,
"decay_method": "noam",
"learning_rate": 0.905,
"warmup_steps": 1000,
After 16000 steps (default warming phase), the learning curve will follow
"learning_rate": 0.04,
with other parameters to default.
Only, the extra learning at the beginning of noam decay makes the model converge faster in the first 16k steps, so one ends up having quite the same runtime as with default schedule with a much smoother end of training and a cleaner convergence.