To sum it all up:
- Relative/rotative positional encoding dramatically improves syntax,
- GLU dramatically reduce noise in the transformer that leads to incoherent or missing output.
Implementing each of these feature yields to “T(f) outperforms T(0)” on a comet-compare measurement, that’s how powerful they are.
Now for the tricky part: they also powerfully increase the complexity of calculations during training, and remember the chaperone analogy…
At times, the GLU layer simply does not get the point anymore, shuts off or saturates all output from an attention head ; the phenomenon then propagates through the layers, which yield to “vanishing” gradients.
Several factors are involved in this phenomenon, the scheduler also doesn’t help -vanilla scheduler is inappropriate for the GLU when using RoPE, but a smoother scheduler (see most recent post) does not allow the models to converge properly either, I tried it in 2024 to no avail.
To avoid these, the attention mechanism has to be amended so as not to solicit all neural units at once permanently. You surely know about the “dropout” mechanism while training, well, that’s a balancing act: instead of guessing (and counter-guessing using the GLU) endlessly, the training process gives NU occasional time to rest.
This has been implemented by Pr. Tri Dao at Princeton under the canny name “flash attention” that accurately describes the material resources optimization that the algorithm introduces too.
Now, this feature is even more powerful than the former two: using flash-attention allows you to complete (yes, really complete) training in a day or so on a gamer’s card.
Look the Tensorboard screenshot @lynxpda posted in May '24 while discussing [our main topic].(Help Wanted: Improve en-de translation). Whilst training high-end models may be more complex, this reaches the 80/20 border.
Therefore, all state-of-the-art models use at least one or both these features, what explains at least part of their accuracy and ability to process huge prompts and contexts.
Of course, there are also other features involved, namely
- the number of attention heads in the model (attention is all you need, the more the better), and
- consummate curation of the dataset’s used for training (garbage in, garbage out).
Actually, if it was a walk in the park, I wouldn’t be depicting it in such detail.