Announcing NLLU - No Language Left Unlocked

I’ve started working on NLLU (https://github.com/LibreTranslate/nllu) with the goal of running inference on NLLB at scale (and cheaply) to generate a corpus of backtranslated data for a variety of languages.

I’ve started running inference on 15 million Paracrawl sentences from English → Italian as a first run. It should take about a week. :boom:

I plan to generate data for Polish and Dutch next.

2 Likes

Awesome project!

When the Italian data is done I can train an OpenNMT model using the generated data to see if it’s an improvement over our current model.

2 Likes

10 million sentences translated. Just waiting on the last 5… :clap:

1 Like

CCMatrix trained a neural network to identify semantic similarity between sentences. Then they found parallel sentences that already existed on the web to use as translation data.

To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations

I think this project is a sort of improvement on the CCMatrix approach enabled by better large language models. Instead of using a neural network to identify existing sentence pairs in web crawl data you’re using a neural network to generate the translation.

1 Like

Here’s the first dataset (Italian): https://nllu.libretranslate.com

:tada:

1 Like

Dutch dataset in progress :call_me_hand:

1 Like

Done! https://nllu.libretranslate.com/

Link to Dutch dataset: https://nllu.libretranslate.com/paracrawl-en-15M/nl.zip

What’s interesting about these datasets is that since they all use the same source text, they can be mapped directly to one another, e.g. it’s possible to distill a Italian <=> Dutch dataset.

Note: I’ve found an alignment issue due to a fault in nllu-server, which was causing some bitext pairs to be misaligned. I’ve fixed this issue for both Italian and Dutch and I’m re-uploading the new datasets, so I recommend re-downloading the datasets for better training results.

2 Likes