Locomotive features and fixtures

NicoLe · February 1, 2025, 4:44am

Hello guys.
Over last year, i developed an array of things around Locomotive and since last summer I was feeling quite the heat so I did not take time to publish these properly.
There’s:

introduced “utils” directory (when running lots of utils, handier than cache)
allowing translation memoirs as dummy flores200 dataset (convert tmx to moses with a separate script if necessary, then name them like they were a flores200 dataset and put them where they belong (utils/flores200…)
Updating the NLLB languages in data: i appended language names used in opus as comments and went a little further (with our persian teacher, i could tell between persian pes and prs, edited tatar with tt code, as the code crt in the current list stands for crimean tatar) but further ado, i’ll leave it to the community
Two useful filters, that use fasttext (another subdir in utils, for third language filtering, i mostly use it on EU and TED/QED/Wiki corpora) and limit parasite latin characters (proving handy with HPLT corpus in Arabic and Japanese NLLB corpus). I had to quite tweak the process_data method to turn the language specified from the config.json into kwargs and feed em to lambda.
Downloading spacy xx or stanza depending on the stanza response (i can maybe code specific spacies first, haven’t so far…) and packaging whatever comes,
Some fixes in byte fallback or onmt configuration parameters,
“Prepare data only” in train.py : this stops the script right before shuffling and when using more than 10 sources, avoids random memory errors while running the multithreaded writer. You can then process the few subsets obtained in a second step.
Default setting the Lynx configuration that works with any gpu generation (161M parameters, on a Tesla V100S takes up to 3 days to fully train, on an rtx 4000, two, and on a rtx6000ada 20 to 40 hours). It improves performance by an order of magnitude compared to vanilla configuration.
Include pivoting to another language in the eval (put the unzipped corresponding pivot argos packages in utils)
Using cometkiwi to assess data quality and scrap the worst half (that’s what i figured out was the best, and Google too, they published when I was researching the right amount … well for some languages, it isn’t, but you can figure that out if you know the languages: at the median, there should be a majority of sentences with minor errors, and less than 30% with critical ones)
Using an Argos model to pivot a whole corpus (can come handy for basque to pivot eu-es into eu-en or vice-versa, that’s how I generally use it, pivoting xx-en data into xx-fr. Crucially, as long as you assess and cut with cometkiwi afterwards (right now there’s a 20-30% loss at translation due to quality, but following steps 10 to 12, this should get better), and provided you hav at least 20% of original non transformed corpora, it yields better results than the english pivot. For now it’s a script, I have to recode it as a transform
Recoding the dedup: first, since it introduces an imbalance between datasets when using the reverse option for training, and second, since it is totally random so you can have bad luck and prepare a model that’s up to 25% les accurate only because the shuffling put the wrong alternates first and dedup picked them for training. After two weeks of dataset research and a fumble (which erased mistakenly a training dataset) i finally got an acceptable ja-fr model that was 30% more accurate than the previous obtained on the same configuration. So i am coding something that would scrap true duplicates, compute cometkiwi scores and get best/some/any/random alternatives as a result. Trying to optimize the resources, but running cometkiwi on a TeslaV100S (5k cuda cores) non-stop in the cloud for 3 months to assess UN work languages corpora made buying an rtx6000 (18k cuda cores) a very rational decision.
Add traceability: keep filtered or deduped data on the side and add metadata jsonline files with relevant data (corpus, line number within the corpus, filters/tranforms/augments applied, cometkiwi scores -to and fro-). That would allow further data science on the datasets.
Automating readme citations parsing this metadata.
And allow fixing the seed in onmt for more accurate research on hyperparameters.

Please tell me what would work for you (and I’ll PR it) and what won’t (I’ll put it on my own branch).

pierotofy · February 1, 2025, 5:55am

Wow, that’s quite a bit of changes! Nice work.

This one seems particularly important, perhaps it should be prioritized.

In general, any contribution that makes the software more useful, accurate, easier, etc. for general purpose training is welcome. (Is it going to be useful for others and is it going to be easy for others to use?)

What’s probably out-of-scope are scripts or logic that were beneficial for a single dataset or goal, but don’t really generalize to training other languages, or part of a workflow that is not easily integrated/documented.

NicoLe · February 1, 2025, 6:08am

Ok, so I’ll assess each step on these criteria and will make a proposal.
I was actually working on the dedup when I had to prioritize the work on Argos since January was coming to its end and I had told i’d do it before then.
Also, to integrate it fully, you’ll have to integrate cometkiwi first, which uses a CC-NC-BY-SA license, so non-commercial use only and share-alike. I don’t mind too much because eventually I should open source all my artifacts, but do you?
Fasttext is also a cc share alike, but I only had to credit the creators in filters.md to comply.

NicoLe · February 1, 2025, 12:56pm

Assessment:
Step 1 to 7 and 9 are already operational within my version of the legacy code. I need a few days to clean-commit them one by one into the project, and make a PR.
1 became quite obvious when I came to four permanent directories in the cache
2 edits less than five lines of code, is useful to anyone willing to assess a specialized model
3 useful for anybody looking for data in opus and not finding it straightforwardly (the comment include the absence of an NLLB dataset for some languages) the review was extensive
4 fast_lang filter I use on nearly all datasets for several corpora, limit_latin_chars is useful for Japanese, Arabic and Korean at least, both are already functional in filters/data with a couple dozen lines of code
5 spacy download already integrated: if my PR goes into Argos-translate quickly, I may consider amending it (no need for a packaged xx_ud_sent_sm anymore)
6 those issues I encountered in Chinese, where the byte-fallback tokens should be extensively trained, and are not, so I guess, only config parameters, we’ve already discussed, it’s about restoring some ONMT default params that may have evolved since when you’ve scripted
7 This is open to discussion: upon merging big datasets (more than 50M sentences and 10 sources) the memory error was systematically occurring, so I began producing smaller subsets splitting NLLB in 20M sentences excerpts, UN, and intensively filtered corpora in “filters” subset, then assembled the subsets later (after cutting them using comet kiwi). Then, it may be OS or hardware dependent, I’ll have to check on my new hardware first.
8 This I will most likely keep in the config.json files, and produce an TRANSFORMERS.md file to document: I forgot that the Lynx architecture needs editing 2 files in CTranslate2 (@lynxpda had made a PR#1687, but it’s remained in PR hell ever since. Understand, this PR allows we the people to beat the owner of CTranslate2 at its own game, so it didn’t play ball.)
9 This is useful for anyone to assess a model they have produced against a third-language they have to translate into, it’s about 20 lines, the implementation is quite straightforward (unzip the pivot Argos package into the utils directory, there you go, there are two extra arguments pivot_from and pivot_to, I probably can add a few lines of code and make it one arg only if you require it)
10 so far, this is run in a separate script, but for thorough deduplication, it needs integration into data or a utils module. Then there’s the license issue: you cannot download the model without creating an account on hugging face and accepting it. Integration should be straightforward, but the bigger issue is how long it takes to assess a dataset: although this yields way more accurate models, calculating these scores to (and fro they’re different) IS time-consuming, we’re talking an hour per 2 million sentences (and then another backwards) on a high-end consumer GPU. In this regard, the current random method is much faster, only dedupe should come before shuffling. I’ll leave it up to you to decide what’s best.
11 So far, this runs in a script, so the PR will depend whether integration in a transform is straightforward or not, caching two related corpora together requires a tricky code edit too. Useful for regional language (Basque, catalan, …) and to whoever wants to train languages not to English and have relatively small (<15M sentences) data available to that end. This is also time consuming, about as much as previous step.
12 Will be prioritized between or before 10 and 11, may take a while if I encounter RAM issues.
13/14: Not straightforward at all, and very niche, I’ll keep it in my fork, one can always write cometkiwi scores in a txt file to select alternates
15: One line to be slipped into the various fixtures (#6) commit with the default “auto” value

NicoLe · February 6, 2025, 5:16pm

So I pulled 1 to 6 and 9 into Locomotive today, as well as the necessary editions in train to allow 8 (not yet the TRANSFORMERS.md file, but some comments that already help).

As for spacy, I rewrote the code to include spacy mono-language packages when they can beat stanza, that requires a new module, since there are 90 lines of new code to this end.

I’ll see what I may or not include in the transformer.md, but most of it can be derived from our discussions with @lynxpda last spring.

Now, it’s down to the dedup thang, how to curate datasets with cometkiwi… which intertwine quite a lot in my current process, and the pivot transform.

@pierotofy : Tell me what kind of dedup you’d like (all alternates, run cometkiwi for x hours then select alternates, or only first one encountered), and i’ll optimize it
(if first one, then shuffling will go after dedup: since NLLB in in deacreasing order of laser scores, that should help somehow).

pierotofy · February 6, 2025, 8:09pm

All alternates seems like the way to go.

NicoLe · February 7, 2025, 11:46am

OK, I had already coded this in a branch among other things, so it’s ready.
I tested and debugged it on

a very small dataset to begin with (hy-fr 340k sentences, removed 29k in seconds)
a bigger one with many duplicates (MultiUN+UNPC in zh-fr, 31M sentence pairs read, 27M written, total run time was 12 mins, RAM use maxed at 5GB, that’s less than input files’ size, though not as efficient as I thought it would be)
an intermediate one (8.7M sentences tt-en, removed 28k, total time to run, a few minutes)

The duplicate count is coherent with the overlap between the corpora:

since the original encoding in UNPC renders ’ as " or ' (and so on), it ends up keeping the correct sentence from MultiUN and the typos from UNPC (at least not choosing them randomly as when filtering from zh with the legacy ddup)
regarding the Chinese text, the only differences are the presence of absence of “of” (的 character), which is more often implicit than not, so it’s ok
and last but not least, whereas the legacy ddup dumped roughly 60% of NLLB tt-en, this one keeps 98%, so it is definitely useful for low-resource language.

However, I guess there are many alternates of varying quality within the half-some that have been saved.

Regarding NLLB, this can be adressed quite simply using the top filter and picking the percentage parsing NLLB with someone who speaks the language. Please tell me if you want to keep it that way, and I’ll commit/push. I might be able to tell you next Wednesday if it trains better tt-en models (this week-end I’ve got some other training on the bench).

As for other corpora, and since I use a lot of them, I’ll get onto feature 10; i.e. integrating cometkiwi scoring and picking alternates on a “best” or “better” (1-2 =>1, 3-6=>2, 7-12 =>3, 13+=>4) basis.
The latter nut is tougher to crack, but I’ve already nagged Copilot into giving me a starting basis last month. I have to check the RAM use too: it’s got to be working with at least 100M sentence pairs, and i’ve got a “sort” operation in the code that might spoil the party…

Please tell me if you’re interested with this feature, considering that for UN corpora it’ll take at least 7 hours (and then 7 back) to process the data instead of 12 minutes. As far as I am concerned, it’s worth the while, but for anyone willing to make a decent model for everyday use, it’s overkill.

Please also tell me if you are interested in the 7 (–subset option in train, code is functional, but i have to swap a few lines now that dedup can go before it), 11 (“pivot” transform, that takes hours too).

pierotofy · February 7, 2025, 7:08pm

This seems like a big increase in runtime. I think it’s important to keep the speed reasonably fast for casual users (so maybe not in scope for inclusion?)

For 7 I can see the use of a standalone script for splitting large datasets, but don’t think this should be tightly integrated in the training script (so also perhaps not in scope)

NicoLe · February 8, 2025, 6:25am

Then i’ll check how well training runs with this new deduplicator (mostly testing how often keeping all alternates incurs vanishing gradients and if this can be mitigated with top and excerpt filters, that may take a few weeks, I’m traveling 15 to 23rd), and i’ll push it with the train.md (better than transformers, casual users might not see the point).
If you please, I’ll also include a data.md with recommendations for datasets assembly. It will mostly include remarks about using filters and which corpora are the most useful. Nonetheless, I’ll mention QE eval use (cometkiwi and bleurtqe) quoting from the Google pre-print, and also the rationale for (too many threads running in the writer compete for adjacent memory space and end up using very fragmented memory spaces, eventually crashing memory) and how to split into subsets with the first script i used to regroup a subset from training data.
Further features, I’ll keep in a distinct branch that I’ll create on Monday.

NicoLe · February 27, 2025, 9:24am

Hello,
I’ve been running the new deduplicator with cometkiwi (5M sentences) and without (10M) on the tt-en dataset I use for dev. While cometkiwi helps obtaining a better model, training does not run amok, so I’ll try training on a big diverse en-fr dataset in the next days, and will release it to PR if ok.
I’ll also add a small commit for a new transform that recodes on-the-fly html escape characters present in the UNPC corpora.

NicoLe · February 28, 2025, 2:29pm

Been coding alternates so far, I have made some stats and logged alternate tables because I could not believe my eyes: there can be as much as 2000+ (!) alternates for one sentence in source, several dozens is not uncommon…

This is why selecting a random alternate in source (legacy dedup) was quite unreliable.

NicoLe · March 11, 2025, 9:49am

The deduplicator is tested and now in the PR (I had to close and reopen following a git mistake).
It yields good results on both low- and high-resource languages, I think it may make Locomotive much more efficient on the former, and it also helps in English-French (more than selecting excerpts, actually).
On most low-resource languages datasets like NLLB contain as much as 75% of alternates, so taking them all in greatly improves language modelling.

NicoLe · March 19, 2025, 4:37am

Hello, I have two more commits that might be of interest,

Still have this memory error when merging big (+100M) dataset, same thing, error occurred consistently around sentence count 80M (when merging UN corpus, NLLB 40% top or excerpt and various others) , 30-40M sentences written only. The subsets feature solves that, it’s clean now: put a letter in the config/version and you’ll prepare deduped unshuffled data. I merged everything together but the NLLB, then the subset and NLLB, things ok 208M sentences (swapped for some hours upon dedup, at 6GB RAM per 10M sentences do the maths, but went through it eventually).
While clean-developing the transform feature, I regrouped all translation methods in a translate module, and modified the eval module to enable automatic pivot to 3rd language with any other argument, flores-id, interactive mode, u name it… Thing’s a little more verbose (states where translation artifacts are being initialized) but works fine, also an optimized function for files comes as bonus in the new module (not called anywhere but for those who wanna script it, spares the trouble). Thought this might be useful in general so i made a separate commit before pulling the pivot transform itself.
You interested?

I have also noticed the shuffling can make for trouble… whilst tweaking my code for alternate selection and training, came upon the most untranslatable validation set… val bleu stuck at 1.3 instead of 50+, there’s a consequential drop in quality as a result (30% of worse translations according to comet compare), not that big a deal but enough to alter serious research.
Do you have any suggestion for this issue (i can try write a new lib in python, but it’s been a loooong time since I have written anything in c)?

argosopentech · March 20, 2025, 10:43am

I typically setup very large swap space (50GB+) on the servers I prepare the datasets on to avoid this.

NicoLe · March 20, 2025, 6:16pm

Oh, i’ve got plenty of swap, the total memory use while dedup runs rises up to 130GB (tested on a cloud instance), so i’ve configured it accordingly on my workstation, otherwise it swapped 12 hours through it.
This mem error upon writing (multithreaded in Locomotive) seemingly occurs because memory maps become disorderly at some point (after closing threads for little corpora and when the big ones are read at more than 20%). Writing aborts and reading continues so the dataset and the counters become incoherent, i’m not even swapping yet when it hits.

NicoLe · March 22, 2025, 3:12am

Some more details about this memory error:

memory occupancy is around 20GB at error occurrence, falls to 15 (writer aborts shortly before 30M sentences written) and rises still afterwards (reading goes on)
it may be Windows-related (historical memory management issues) but is quite consistently reproduced even right after hard reboot (windows historical memory fix) so maybe not… and i am 10 times more familiar with windows (my first certification was MSCE) than Linux CLI.
systematic when merging several small (<1M sentence pairs), middle and more than one big (>15M) corpus (first when I was researching with @lynxpda and compiled a 90+M sentence de-en dataset with CCMatrix 10-30%excerpt, EuroPat, and so on to federal govt corpora)
became systematic on en-fr configs, so i had to figure out why and how to fix it…
reliable fix is merging small and medium corpora into a single subset, and the UN corpora into another (for en-fr i merge either NLLB top 25 or 40%, UNPC, giga-fren, MultiUN, EuroPat, etc…),
writing runs smoothly with 2 to 4 reading threads (5 is already too much) so it’s likely due to multiple concurrency of mmaps,
although this is never a problem when merging smaller corpora: merging a dozen medium and small sized, or a big NLLB with a whole lot of smaller ones is ok even when total count overruns 30M sentences.

NicoLe · March 30, 2025, 6:29am

Actually, it may be OS related.
As i was looking forward to optimize the code further downstream (dataset coverage in the sentencepiece tokenizer) , i noticed another bug on spm train: large datasets would not load correctly before selecting the training sample (only 4M some sentences load, and only 1M from src-train.txt, so the tokenizer gets biased).
Since this looked like the same kind of silent memory error, i tried several os and hardware and found out Linux not to feature that bug, but Windows to demonstrate it consistently no matter the hardware. The threshold for this bug is in between 5M and 25M sentence pairs (have to review former logs from Asian language trainings to precise the range).

The tokenizer’s dataset coverage is quite a factor in the models’ behavior: a bias in the tokenizer vocabulary may have incurred some of my findings last year to be unreliable, mostly the deep encoder/shallow decoder performing better than symmetrical transformer or shallow decoder/deep encoder. I have to rerun some experiments to clear this out.