I’ve started working on NLLU (https://github.com/LibreTranslate/nllu) with the goal of running inference on NLLB at scale (and cheaply) to generate a corpus of backtranslated data for a variety of languages.
I’ve started running inference on 15 million Paracrawl sentences from English → Italian as a first run. It should take about a week.
I plan to generate data for Polish and Dutch next.
2 Likes
Awesome project!
When the Italian data is done I can train an OpenNMT model using the generated data to see if it’s an improvement over our current model.
2 Likes
10 million sentences translated. Just waiting on the last 5…
1 Like
CCMatrix trained a neural network to identify semantic similarity between sentences. Then they found parallel sentences that already existed on the web to use as translation data.
To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations
I think this project is a sort of improvement on the CCMatrix approach enabled by better large language models. Instead of using a neural network to identify existing sentence pairs in web crawl data you’re using a neural network to generate the translation.
1 Like
Here’s the first dataset (Italian): https://nllu.libretranslate.com
1 Like
Dutch dataset in progress
1 Like
What’s interesting about these datasets is that since they all use the same source text, they can be mapped directly to one another, e.g. it’s possible to distill a Italian <=> Dutch dataset.
Note: I’ve found an alignment issue due to a fault in nllu-server, which was causing some bitext pairs to be misaligned. I’ve fixed this issue for both Italian and Dutch and I’m re-uploading the new datasets, so I recommend re-downloading the datasets for better training results.
2 Likes