Locomotive : using rotary positional encoding and gated convolutional layers (on Windows)

Last year we (@lynxpda and I) did parallel research on Russian-English (@lynxpda) and German-English (my bad).

@lynxpda ended using rotary positional encoding and gated convolutional layer with amazing results : at par with Google translate.

Since I did most of my career on Windows server as an MSCE sys admin, I was leasing a W2022 server with an “old” Tesla GPU and had my reasons to stick to it, I pursued only with the second feature, using relative positional encoding instead of rotary.

Still, I’ve been able to outperforms consistently commercially available on-premise alternatives.

Of the Argos models I produced (32 final prototypes to date, 28 deployable, 20 deployed, 4 using spacy), 26 pivot around French instead of English, and all feature a non-commercial license. Since I cannot publish them freely, might as well share the method.

So now, I’ll dive into the rationale for using these features, why it is a feat to implement, assuming you run Locomotive like me on a Windows workstation, and how to optimize them right. This’ll take a few posts, and days, because the last part is still a work in progress.

Take it like a miniseries.

1 Like

Positional encoding plays when you encode/decode a sentence:

  • vanilla models have an “absolute”, or “true” positional encoding, e.g. the tokens’ position in a sentence is treated in “absolute” coordinates. Problem is, this “absolute” encoding is compressed in a way that makes it irrelevant when two tokens are more than a half-dozen positions apart.

German verbs have lots of prefixes that spell at the sentences’ very end… “True”/vanilla positional encoding makes for generally “funny” translations.

  • hence, Google (Dr Shaw to quote a name) developed, early in the transformer era, “relative positional encoding” when each token knows its neighbours on a scale that ranges [-N;N] with N values usually 20 or 32, a.k.a. “Shaw N encoding”.

This encoding yields a much better syntax than the vanilla one, and not only for German: subject-object-verb languages (Turkish, Farsi, Hindi) are translated with a very good syntax from and to French (fr-tr reaches COMET scores similar to fr-it or fr-pt, in spite of linguistic distance).

However, some languages (Chinese, Japanese), feature extremely strict syntax for sentences to even make sense, and tokenize expensively: seldom-used Chinese characters or kanji encode on 3 tokens, so welcome back to square one.

  • for this reason, Su & al. invented “rotary” positional encoding a.k.a RoPE: positions are represented as an angle on a circle with a complete circonvolution every 10k (that’s like a thousand in Chinese) tokens.

At common sentence length (some dozens of tokens), RoPE is equivalent to Shaw: the angles (very small, that’s N*pi/10000) are almost homothetic to relative positions.
The difference arises with very long sentences, large prompts or one-shot paragraph translation.

Now “gated convolutional layers” (also known as “gated linear units” or GLU) are a trickier concept.

In a transformer layer, there are “attention head”, and “feed forward”, the nickname are common sense:

  • an attention head will query a part of the vector representing the currently processed token, called “key”, and return a “value” (the formula v = q*k) ,
  • then the value will be stored (fed) in the “feed forward”, which has thus a memorization function in the attention mechanism.
  • said feed-forward will pass v (with a normalization dropping infinitesimal variations) to the next layer for querying, and so on until it runs through the whole gizmo.

Whilst output and input are permanently assessed against one another upon training, during inference, there’s nothing like a stopgap to the values running through the transformer,
which opens a highway for, you name it, confabulations.

To prevent aberrant values from propagating, Noam Shazeer & al. thought of using a gated layer on top of the feed-forward (ff). As input, it takes the “value” from the ff and the “query” from the attention layer, and, like a chaperone, checks if they are “presentable”.

One can also picture the mechanism as a counter-interrogatory, the maths actually work in a similar way.
But the chaperone thing explains better some of the phenomena occurring thereafter.

Since it combines elements from both layers below, this feature introduces convolutional features to the model, without the retroaction loops that prevailed before the transformer though.

To sum it all up:

  1. Relative/rotative positional encoding dramatically improves syntax,
  2. GLU dramatically reduce noise in the transformer that leads to incoherent or missing output.

Implementing each of these feature yields to “T(f) outperforms T(0)” on a comet-compare measurement, that’s how powerful they are.

Now for the tricky part: they also powerfully increase the complexity of calculations during training, and remember the chaperone analogy…

At times, the GLU layer simply does not get the point anymore, shuts off or saturates all output from an attention head ; the phenomenon then propagates through the layers, which yield to “vanishing” gradients.
Several factors are involved in this phenomenon, the scheduler also doesn’t help -vanilla scheduler is inappropriate for the GLU when using RoPE, but a smoother scheduler (see most recent post) does not allow the models to converge properly either, I tried it in 2024 to no avail.

To avoid these, the attention mechanism has to be amended so as not to solicit all neural units at once permanently. You surely know about the “dropout” mechanism while training, well, that’s a balancing act: instead of guessing (and counter-guessing using the GLU) endlessly, the training process gives NU occasional time to rest.

This has been implemented by Pr. Tri Dao at Princeton under the canny name “flash attention” that accurately describes the material resources optimization that the algorithm introduces too.

Now, this feature is even more powerful than the former two: using flash-attention allows you to complete (yes, really complete) training in a day or so on a gamer’s card.

Look the Tensorboard screenshot @lynxpda posted in May '24 while discussing [our main topic].(Help Wanted: Improve en-de translation). Whilst training high-end models may be more complex, this reaches the 80/20 border.

Therefore, all state-of-the-art models use at least one or both these features, what explains at least part of their accuracy and ability to process huge prompts and contexts.

Of course, there are also other features involved, namely

  • the number of attention heads in the model (attention is all you need, the more the better), and
  • consummate curation of the dataset’s used for training (garbage in, garbage out).

Actually, if it was a walk in the park, I wouldn’t be depicting it in such detail.

Running it is great, getting it running a nightmare.

For starters, you need to study the system in detail, starting with your platform:

  1. How many processors does your machine feature.(not CPUs)?

Installing flash-attention takes hours, or requires “ninja”, but the latter uses all available cores of the processor that runs it. If your platform is not multiprocessor, it freezes.

  1. What generation is your GPU?

flash-attn only supports Hopper/Ampere/Ada (Hopper systematically, the other two some versions only) After three weeks last year, I had a seemingly working install that was not doing the job for this precise reason.
Now that I’ve had enough time to try on Ada chips, I realized the latest version does not really support Ada even on Linux.

  1. What OS do you use?

The main page says: Linux. Windows from version 2.3.2, Linux, well, is supported, but for Windows, compiling a version depends on so many factors that despair is allowed…

  1. What C compiler, IDE, SDK?

Only what I really had to check to get Argos/LT and Locomotive running on CUDA, plus PyCharm, Ollama to run a local coding assistant, OnlyOffice and Notepad++

  1. What is your PyTorch version?

There, things get counterintuitive: first attempt I made was installing the stack top to bottom from the versions @lynxpda and I had used…
user warning: PyTorch has not been compiled with flash attention...
Then I went through all the release docs for CTranslate2, opennmt-py, flash-attention… and it finally struck: you need to install from bottom to top, for actually flash-attention has been compiled only for certain, definite, PyTorch versions.

The functional torch version for Locomotive with flash-attention compiled is 2.1.x+cu12.1:

  1. Installing the final dependencies is also tricky:

CTranslate2 supported flash-atto from 4.2.0 abandoned supporting flash-attention from Python from 4.4.0, and starting 4.0.0, does not support CUDA 11… :minus: I settled with 4.3.1 for the fixes (last year in June, @lynxpda used 4.2.1, but better compilation is always welcome).

The most compatible version of opennmt-py that follows is the last one produced, 3.5.1.
Theoretically, we could use 3.4.3, at least we did least year.with ctranslate2 4.2.1… no can do anymore :-1: 4.2.1 bugs badly at launch, traceback is sibylline enough to drop the hat and try the latest version, successfully proofed last year and this one too.
NB: At least it was still maintained at that time, we’ve got to do something about it, read further posts.

And flash-attn?

Cherry on the cake: the “recommended” 2.3.2 was compiled with torch2.1 and CUDA 11.8 (but CT2 needs CUDA 12, or does not support flash)…
No other build was known for Windows, issues on Github, stack, reddit are full of desperate people trying voodoo-style compilation of the dependancy to make it work on Windows.
Oh my!

So we need to

  1. support cuda 12 (that’s “fixed” from 2.4.3.post1)
  2. compilation for torch2.1 and not torch 2.3 (for opennmt-py use), thus before release 2.5.1.post1.
  3. Needless to say it took me some patience and countenance from release 2.5.1 down to 2.4.0.post1, finally installed successfully in about 3 hours total (1h40 for said version on 14 cores & 32GB RAM).
    Shortening installation time requires previously installing “ninja”, do not try it on a single processor UC.

The full dependency stack follows:

pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install ctranslate2==4.3.1
pip install opennmt-py==3.5.1
pip install flash-attn==2.4.0.post1 --no-build-isolation

All’s well that ends well, everything fell into place… Now, how to run it?

Basically, one has to add the following 3 lines to their configuration:

    "position_encoding": "False",
    "max_relative_positions": -1,
    "pos_ffn_activation_fn": "silu",

That said, it’s “RoPE/GLU 101”… Now for 102:

  1. “silu” in opennmt-py and CT2 is actually coded as a GLU function
  2. there’s another, slightly more efficient GLU (“gated-gelu”) but it’s compiled in opennmt only,
  3. RoPE is also not wholly supported in the CT2 converter so CT2 requires tweaking to make it work (pleas make sure you do this safely, use an IDE with “Deployment” option.

For more details, have a look at this issue and the subsequent pull request
NB: PR#1687 applied to CT2==4.2.1, rebase it on 4.3.1 before deployment.

Once this is in place, you should be able to run RoPE/GLU: using “gated-gelu” instead of “silu” yields a few nicer translations, the two functions are 98+% similar according to N.Shazeer & al. and we’ve got more urgent problems… that’s RoPE/GLU103

We’ve had an esoterical debate about it last year, I eventually used “gated-gelu” and modified the CT2 converter because it yielded best performances more consistently. However, my best de-en checkpoint ever used “silu”, although it’s been a non-reproducible lucky strike.

Using RoPE/GLU will also completely change the way training runs, because of flash-attention, so further modifications are needed: vanilla scheduling is much too fast for flash-attention.

  1. the learning rate (0.15) is too high, it has to be reduced to something much lower (0.05, 0.02, 0.375, I look for it)
  2. the warming period is too long, flash-attention reaches suboptimal values slightly after 9k steps, and with basic LR, the model freezes definitely…
  3. with LR=0.05, the model freezes from 9k until 16k, with better though suboptimal values, then goes in stalled patience between 16k and 18k while metrics jump back, and improves anew…
    Since the formula is LR/sqrt(N/ws) reducing the warming steps need rising LR to get the same curve. So common sense dictates adding this to the config:
    "learning_rate": 0.0375,
    "warmup_steps": 9000,

However, even these values throttle training: the throttling point appears at 8k steps, so I tried 0.02, to the same avail…
At this point, chnaging the scheduler appears like the best option available. @lynxpda used the scheduler invented by N. Shazeer: the formula is different, hence a learning curve that looks much like the one from above params, only between 0 and 9k, it spikes from 0 and plummets after warming instead of remaining constant at LR/sqrt(ws) (0.00119 in vanilla vs. 0.0004 supra).

    "decay_method": "noam",
    "learning_rate": 0.5,
    "warmup_steps": 1000,

This is equivalent to using 0.02 on ‘rsqrt’, except the short warming period gives training an initial oomph (@ step 1000, effective learning rate will rise to 4 times the ‘rsqrt’ plateau). So whereas ‘rsqrt’ 0.0375 does not converge optimally, 0.5 does.

However, it takes ages to do so, and it takes a very neat dataset for the training not to go into early stopping before (which is what I train onto, but using features that incur added runtime and which license is more restrictive than the current LT license).

So, I am looking for the parameters that would also work well for the community and edit this post accordingly.

Insofar, noam decay allows training “BIG” transformers where tokens are encoded on 1024-dim vectors, while rsqrt only allows training 768-dim vectors, theoretically large enough but not the industry’s standard.

As of the differences between systems (Windows and Linux have small differences in processing seeds and some torch features also process differently on the systems), while there are manifest differences between training metrics on two experiences run in parallel, the converged models exhibit the same (within the error margin of COMET metrics) characteristics at evaluation.

Finalizing the models with “avg_checkpoints”: 3 or 5, (if training ends on an early stopping, 5 averages all the checkpoints from the best metrics on) averages the latest checkpoints and cancels those training discrepancies.

After a few trials, the most efficient scheduler seems (optimum in range .754 to 0.9) to be:

    "avg_checkpoints": 3,
    "decay_method": "noam",
    "learning_rate": 0.905,
    "warmup_steps": 1000,

After 16000 steps (default warming phase), the learning curve will follow

    "learning_rate": 0.04,

with other parameters to default.
Only, the extra learning at the beginning of noam decay makes the model converge faster in the first 16k steps, so one ends up having quite the same runtime as with default schedule with a much smoother end of training and a cleaner convergence.

Something that puzzled me throughout last year was that some languages seemed to perform way better than others on the very architecture that outperformed commercial alternatives. Having tested several dozens of hyper-parameters combinations on de-en and the best ten on en-fr, I was at a loss finding a sensible explanation: there is a host of possible reasons, the most obvious name being dataset overall quality.

This does not hold experiment though: namely on fa-fr, from a small dataset of dubious quality (so said our farsi expert), I pulled out (last December) a model that outperformed even GoogleTranslate.

Then, late March, I’ve had a lucky result I could not reproduce after more than ten trials on en-fr. Given the number of users, I made an experimental set to crack this out (training on three un related languages 12 random models, then 18 with fixed seeds and different tokenizers over the course of 2 months).

First, I had thought data preprocessing (namely, the validation set slection) was involved, so I rewrote the shuffler/sampler to trace validation data relevance across trainings. Eventually, I came to the conclusion that it influences how the model converges, not where (an exception being when val.data and train.data are intentionally unaligned: training converges too fast).

These experiments eventually highlighted that the tokenizer’s (sentencepiece) content has a huge effect on model performance. Namely, a “good” tokenizer will yield a model that’s an order of magnitude (we’re talking 1.5+ COMET points, and BLEUs 50+% higher) better than another featuring a “bad” one.

Since this dwarves most of the improvement that tweaking transformer hyper-parameters can offer, I am currently researching the topic, and coming ex-post (after 3 weeks training en-fr/fr-en alternatively) to some conclusions.

  1. rotary position encoding (max_relative_positions = -1) or relative position encoding (max_relative_position = 16, 20 or 32) make much better use of embeddings,
    "position_encoding": "False", #instead of "True" for default
    "max_relative_positions": -1/16/20/32, #instead of 0 for default)

On fr-en, from a thoroughly curated 90M sentences dataset, all other transformer parameters being equal, models featuring “True” encoding will never reach COMET-22 scores over 0.88 while those featuring “relative” position encoding mostly score above 0.89, and those using “rotary” position encoding always do.

  1. models featuring rotary encoding feature much more consistent performance than those using relative encoding across a range of tokenizer trained from different samples in the dataset,
  2. nevertheless, for some tokenizers, relative position encoding outperforms rotary position encoding

I’ll finish this post if I succeed in determining ex-ante the most promising tokenizer without training models using a dichotomy algorithm…