Locomotive features and fixtures

Hello guys.
Over last year, i developed an array of things around Locomotive and since last summer I was feeling quite the heat so I did not take time to publish these properly.
There’s:

  1. introduced “utils” directory (when running lots of utils, handier than cache)
  2. allowing translation memoirs as dummy flores200 dataset (convert tmx to moses with a separate script if necessary, then name them like they were a flores200 dataset and put them where they belong (utils/flores200…)
  3. Updating the NLLB languages in data: i appended language names used in opus as comments and went a little further (with our persian teacher, i could tell between persian pes and prs, edited tatar with tt code, as the code crt in the current list stands for crimean tatar) but further ado, i’ll leave it to the community
  4. Two useful filters, that use fasttext (another subdir in utils, for third language filtering, i mostly use it on EU and TED/QED/Wiki corpora) and limit parasite latin characters (proving handy with HPLT corpus in Arabic and Japanese NLLB corpus). I had to quite tweak the process_data method to turn the language specified from the config.json into kwargs and feed em to lambda.
  5. Downloading spacy xx or stanza depending on the stanza response (i can maybe code specific spacies first, haven’t so far…) and packaging whatever comes,
  6. Some fixes in byte fallback or onmt configuration parameters,
  7. “Prepare data only” in train.py : this stops the script right before shuffling and when using more than 10 sources, avoids random memory errors while running the multithreaded writer. You can then process the few subsets obtained in a second step.
  8. Default setting the Lynx configuration that works with any gpu generation (161M parameters, on a Tesla V100S takes up to 3 days to fully train, on an rtx 4000, two, and on a rtx6000ada 20 to 40 hours). It improves performance by an order of magnitude compared to vanilla configuration.
  9. Include pivoting to another language in the eval (put the unzipped corresponding pivot argos packages in utils)
  10. Using cometkiwi to assess data quality and scrap the worst half (that’s what i figured out was the best, and Google too, they published when I was researching the right amount :grinning:… well for some languages, it isn’t, but you can figure that out if you know the languages: at the median, there should be a majority of sentences with minor errors, and less than 30% with critical ones)
  11. Using an Argos model to pivot a whole corpus (can come handy for basque to pivot eu-es into eu-en or vice-versa, that’s how I generally use it, pivoting xx-en data into xx-fr. Crucially, as long as you assess and cut with cometkiwi afterwards (right now there’s a 20-30% loss at translation due to quality, but following steps 10 to 12, this should get better), and provided you hav at least 20% of original non transformed corpora, it yields better results than the english pivot. For now it’s a script, I have to recode it as a transform
  12. Recoding the dedup: first, since it introduces an imbalance between datasets when using the reverse option for training, and second, since it is totally random so you can have bad luck and prepare a model that’s up to 25% les accurate only because the shuffling put the wrong alternates first and dedup picked them for training. After two weeks of dataset research and a fumble (which erased mistakenly a training dataset) i finally got an acceptable ja-fr model that was 30% more accurate than the previous obtained on the same configuration. So i am coding something that would scrap true duplicates, compute cometkiwi scores and get best/some/any/random alternatives as a result. Trying to optimize the resources, but running cometkiwi on a TeslaV100S (5k cuda cores) non-stop in the cloud for 3 months to assess UN work languages corpora made buying an rtx6000 (18k cuda cores) a very rational decision.
  13. Add traceability: keep filtered or deduped data on the side and add metadata jsonline files with relevant data (corpus, line number within the corpus, filters/tranforms/augments applied, cometkiwi scores -to and fro-). That would allow further data science on the datasets.
  14. Automating readme citations parsing this metadata.
  15. And allow fixing the seed in onmt for more accurate research on hyperparameters.

Please tell me what would work for you (and I’ll PR it) and what won’t (I’ll put it on my own branch).

1 Like

Wow, that’s quite a bit of changes! Nice work.

This one seems particularly important, perhaps it should be prioritized.

In general, any contribution that makes the software more useful, accurate, easier, etc. for general purpose training is welcome. (Is it going to be useful for others and is it going to be easy for others to use?)

What’s probably out-of-scope are scripts or logic that were beneficial for a single dataset or goal, but don’t really generalize to training other languages, or part of a workflow that is not easily integrated/documented.

Ok, so I’ll assess each step on these criteria and will make a proposal.
I was actually working on the dedup when I had to prioritize the work on Argos since January was coming to its end :upside_down_face: and I had told i’d do it before then.
Also, to integrate it fully, you’ll have to integrate cometkiwi first, which uses a CC-NC-BY-SA license, so non-commercial use only and share-alike. I don’t mind too much because eventually I should open source all my artifacts, but do you?
Fasttext is also a cc share alike, but I only had to credit the creators in filters.md to comply.

Assessment:
Step 1 to 7 and 9 are already operational within my version of the legacy code. I need a few days to clean-commit them one by one into the project, and make a PR.
1 became quite obvious when I came to four permanent directories in the cache
2 edits less than five lines of code, is useful to anyone willing to assess a specialized model
3 useful for anybody looking for data in opus and not finding it straightforwardly (the comment include the absence of an NLLB dataset for some languages) the review was extensive
4 fast_lang filter I use on nearly all datasets for several corpora, limit_latin_chars is useful for Japanese, Arabic and Korean at least, both are already functional in filters/data with a couple dozen lines of code
5 spacy download already integrated: if my PR goes into Argos-translate quickly, I may consider amending it (no need for a packaged xx_ud_sent_sm anymore)
6 those issues I encountered in Chinese, where the byte-fallback tokens should be extensively trained, and are not, so I guess, only config parameters, we’ve already discussed, it’s about restoring some ONMT default params that may have evolved since when you’ve scripted
7 This is open to discussion: upon merging big datasets (more than 50M sentences and 10 sources) the memory error was systematically occurring, so I began producing smaller subsets splitting NLLB in 20M sentences excerpts, UN, and intensively filtered corpora in “filters” subset, then assembled the subsets later (after cutting them using comet kiwi). Then, it may be OS or hardware dependent, I’ll have to check on my new hardware first.
8 This I will most likely keep in the config.json files, and produce an TRANSFORMERS.md file to document: I forgot that the Lynx architecture needs editing 2 files in CTranslate2 (@lynxpda had made a PR#1687, but it’s remained in PR hell ever since. Understand, this PR allows we the people to beat the owner of CTranslate2 at its own game, so it didn’t play ball.)
9 This is useful for anyone to assess a model they have produced against a third-language they have to translate into, it’s about 20 lines, the implementation is quite straightforward (unzip the pivot Argos package into the utils directory, there you go, there are two extra arguments pivot_from and pivot_to, I probably can add a few lines of code and make it one arg only if you require it)
10 so far, this is run in a separate script, but for thorough deduplication, it needs integration into data or a utils module. Then there’s the license issue: you cannot download the model without creating an account on hugging face and accepting it. Integration should be straightforward, but the bigger issue is how long it takes to assess a dataset: although this yields way more accurate models, calculating these scores to (and fro they’re different) IS time-consuming, we’re talking an hour per 2 million sentences (and then another backwards) on a high-end consumer GPU. In this regard, the current random method is much faster, only dedupe should come before shuffling. I’ll leave it up to you to decide what’s best.
11 So far, this runs in a script, so the PR will depend whether integration in a transform is straightforward or not, caching two related corpora together requires a tricky code edit too. Useful for regional language (Basque, catalan, …) and to whoever wants to train languages not to English and have relatively small (<15M sentences) data available to that end. This is also time consuming, about as much as previous step.
12 Will be prioritized between or before 10 and 11, may take a while if I encounter RAM issues.
13/14: Not straightforward at all, and very niche, I’ll keep it in my fork, one can always write cometkiwi scores in a txt file to select alternates
15: One line to be slipped into the various fixtures (#6) commit with the default “auto” value

So I pulled 1 to 6 and 9 into Locomotive today, as well as the necessary editions in train to allow 8 (not yet the TRANSFORMERS.md file, but some comments that already help).

As for spacy, I rewrote the code to include spacy mono-language packages when they can beat stanza, that requires a new module, since there are 90 lines of new code to this end.

I’ll see what I may or not include in the transformer.md, but most of it can be derived from our discussions with @lynxpda last spring.

Now, it’s down to the dedup thang, how to curate datasets with cometkiwi… which intertwine quite a lot in my current process, and the pivot transform.

@pierotofy : Tell me what kind of dedup you’d like (all alternates, run cometkiwi for x hours then select alternates, or only first one encountered), and i’ll optimize it
(if first one, then shuffling will go after dedup: since NLLB in in deacreasing order of laser scores, that should help somehow).

All alternates seems like the way to go. :+1:

OK, I had already coded this in a branch among other things, so it’s ready.
I tested and debugged it on

  • a very small dataset to begin with (hy-fr 340k sentences, removed 29k in seconds)
  • a bigger one with many duplicates (MultiUN+UNPC in zh-fr, 31M sentence pairs read, 27M written, total run time was 12 mins, RAM use maxed at 5GB, that’s less than input files’ size, though not as efficient as I thought it would be)
  • an intermediate one (8.7M sentences tt-en, removed 28k, total time to run, a few minutes)

The duplicate count is coherent with the overlap between the corpora:

  1. since the original encoding in UNPC renders ’ as &quot; or &apos; (and so on), it ends up keeping the correct sentence from MultiUN and the typos from UNPC (at least not choosing them randomly as when filtering from zh with the legacy ddup)
  2. regarding the Chinese text, the only differences are the presence of absence of “of” (的 character), which is more often implicit than not, so it’s ok
  3. and last but not least, whereas the legacy ddup dumped roughly 60% of NLLB tt-en, this one keeps 98%, so it is definitely useful for low-resource language.

However, I guess there are many alternates of varying quality within the half-some that have been saved.

Regarding NLLB, this can be adressed quite simply using the top filter and picking the percentage parsing NLLB with someone who speaks the language. Please tell me if you want to keep it that way, and I’ll commit/push. I might be able to tell you next Wednesday if it trains better tt-en models (this week-end I’ve got some other training on the bench).

As for other corpora, and since I use a lot of them, I’ll get onto feature 10; i.e. integrating cometkiwi scoring and picking alternates on a “best” or “better” (1-2 =>1, 3-6=>2, 7-12 =>3, 13+=>4) basis.
The latter nut is tougher to crack, but I’ve already nagged Copilot into giving me a starting basis last month. I have to check the RAM use too: it’s got to be working with at least 100M sentence pairs, and i’ve got a “sort” operation in the code that might spoil the party…

Please tell me if you’re interested with this feature, considering that for UN corpora it’ll take at least 7 hours (and then 7 back) to process the data instead of 12 minutes. As far as I am concerned, it’s worth the while, but for anyone willing to make a decent model for everyday use, it’s overkill.

Please also tell me if you are interested in the 7 (–subset option in train, code is functional, but i have to swap a few lines now that dedup can go before it), 11 (“pivot” transform, that takes hours too).

1 Like

This seems like a big increase in runtime. I think it’s important to keep the speed reasonably fast for casual users (so maybe not in scope for inclusion?)

  1. For 7 I can see the use of a standalone script for splitting large datasets, but don’t think this should be tightly integrated in the training script (so also perhaps not in scope)

Then i’ll check how well training runs with this new deduplicator (mostly testing how often keeping all alternates incurs vanishing gradients and if this can be mitigated with top and excerpt filters, that may take a few weeks, I’m traveling 15 to 23rd), and i’ll push it with the train.md (better than transformers, casual users might not see the point).
If you please, I’ll also include a data.md with recommendations for datasets assembly. It will mostly include remarks about using filters and which corpora are the most useful. Nonetheless, I’ll mention QE eval use (cometkiwi and bleurtqe) quoting from the Google pre-print, and also the rationale for (too many threads running in the writer compete for adjacent memory space and end up using very fragmented memory spaces, eventually crashing memory) and how to split into subsets with the first script i used to regroup a subset from training data.
Further features, I’ll keep in a distinct branch that I’ll create on Monday.