Hello guys.
Over last year, i developed an array of things around Locomotive and since last summer I was feeling quite the heat so I did not take time to publish these properly.
There’s:
- introduced “utils” directory (when running lots of utils, handier than cache)
- allowing translation memoirs as dummy flores200 dataset (convert tmx to moses with a separate script if necessary, then name them like they were a flores200 dataset and put them where they belong (utils/flores200…)
- Updating the NLLB languages in data: i appended language names used in opus as comments and went a little further (with our persian teacher, i could tell between persian pes and prs, edited tatar with tt code, as the code crt in the current list stands for crimean tatar) but further ado, i’ll leave it to the community
- Two useful filters, that use fasttext (another subdir in utils, for third language filtering, i mostly use it on EU and TED/QED/Wiki corpora) and limit parasite latin characters (proving handy with HPLT corpus in Arabic and Japanese NLLB corpus). I had to quite tweak the process_data method to turn the language specified from the config.json into kwargs and feed em to lambda.
- Downloading spacy xx or stanza depending on the stanza response (i can maybe code specific spacies first, haven’t so far…) and packaging whatever comes,
- Some fixes in byte fallback or onmt configuration parameters,
- “Prepare data only” in train.py : this stops the script right before shuffling and when using more than 10 sources, avoids random memory errors while running the multithreaded writer. You can then process the few subsets obtained in a second step.
- Default setting the Lynx configuration that works with any gpu generation (161M parameters, on a Tesla V100S takes up to 3 days to fully train, on an rtx 4000, two, and on a rtx6000ada 20 to 40 hours). It improves performance by an order of magnitude compared to vanilla configuration.
- Include pivoting to another language in the eval (put the unzipped corresponding pivot argos packages in utils)
- Using cometkiwi to assess data quality and scrap the worst half (that’s what i figured out was the best, and Google too, they published when I was researching the right amount
… well for some languages, it isn’t, but you can figure that out if you know the languages: at the median, there should be a majority of sentences with minor errors, and less than 30% with critical ones)
- Using an Argos model to pivot a whole corpus (can come handy for basque to pivot eu-es into eu-en or vice-versa, that’s how I generally use it, pivoting xx-en data into xx-fr. Crucially, as long as you assess and cut with cometkiwi afterwards (right now there’s a 20-30% loss at translation due to quality, but following steps 10 to 12, this should get better), and provided you hav at least 20% of original non transformed corpora, it yields better results than the english pivot. For now it’s a script, I have to recode it as a transform
- Recoding the dedup: first, since it introduces an imbalance between datasets when using the reverse option for training, and second, since it is totally random so you can have bad luck and prepare a model that’s up to 25% les accurate only because the shuffling put the wrong alternates first and dedup picked them for training. After two weeks of dataset research and a fumble (which erased mistakenly a training dataset) i finally got an acceptable ja-fr model that was 30% more accurate than the previous obtained on the same configuration. So i am coding something that would scrap true duplicates, compute cometkiwi scores and get best/some/any/random alternatives as a result. Trying to optimize the resources, but running cometkiwi on a TeslaV100S (5k cuda cores) non-stop in the cloud for 3 months to assess UN work languages corpora made buying an rtx6000 (18k cuda cores) a very rational decision.
- Add traceability: keep filtered or deduped data on the side and add metadata jsonline files with relevant data (corpus, line number within the corpus, filters/tranforms/augments applied, cometkiwi scores -to and fro-). That would allow further data science on the datasets.
- Automating readme citations parsing this metadata.
- And allow fixing the seed in onmt for more accurate research on hyperparameters.
Please tell me what would work for you (and I’ll PR it) and what won’t (I’ll put it on my own branch).