Announcing NLLU - No Language Left Unlocked

pierotofy · August 17, 2023, 2:53pm

I’ve started working on NLLU (https://github.com/LibreTranslate/nllu) with the goal of running inference on NLLB at scale (and cheaply) to generate a corpus of backtranslated data for a variety of languages.

I’ve started running inference on 15 million Paracrawl sentences from English → Italian as a first run. It should take about a week.

I plan to generate data for Polish and Dutch next.

argosopentech · August 18, 2023, 6:14pm

Awesome project!

When the Italian data is done I can train an OpenNMT model using the generated data to see if it’s an improvement over our current model.

pierotofy · August 22, 2023, 1:47am

10 million sentences translated. Just waiting on the last 5…

argosopentech · August 22, 2023, 10:25pm

CCMatrix trained a neural network to identify semantic similarity between sentences. Then they found parallel sentences that already existed on the web to use as translation data.

To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations

I think this project is a sort of improvement on the CCMatrix approach enabled by better large language models. Instead of using a neural network to identify existing sentence pairs in web crawl data you’re using a neural network to generate the translation.

pierotofy · August 25, 2023, 3:34am

Here’s the first dataset (Italian): https://nllu.libretranslate.com

pierotofy · September 12, 2023, 1:05am

Dutch dataset in progress

pierotofy · September 19, 2023, 4:38pm

Done! https://nllu.libretranslate.com/

Link to Dutch dataset: https://nllu.libretranslate.com/paracrawl-en-15M/nl.zip

pierotofy · September 19, 2023, 4:42pm

What’s interesting about these datasets is that since they all use the same source text, they can be mapped directly to one another, e.g. it’s possible to distill a Italian <=> Dutch dataset.

pierotofy · September 23, 2023, 9:09pm

Note: I’ve found an alignment issue due to a fault in nllu-server, which was causing some bitext pairs to be misaligned. I’ve fixed this issue for both Italian and Dutch and I’m re-uploading the new datasets, so I recommend re-downloading the datasets for better training results.

urtzai · September 19, 2024, 3:55pm

Hi all! I’m interested in creating a Basque dataset, due to lack of open Basque datasets.

But I wonder how much will cost to translate those 15 million Paracrawl sentences… how many machines did you need/rent to translate Italian/Dutch texts? And how many time did you spent? Can I ask you how many dollars are more or less?

Thanks for this awesome project!

argosopentech · September 19, 2024, 4:00pm

I’m currently working on a Basque model and am also interested in this. I’ll look into running this on Vast.ai.

pierotofy · September 19, 2024, 4:19pm

Using a sniper program I’ve written (GitHub - pierotofy/vastai-sniper: Automated Vast.ai bidding) it costed around $100-$150 to complete each dataset via vast.ai.

NicoLe · September 19, 2024, 8:30pm

For basque, you can train from spanish without any backtranslation, there are more than 6M sentences in CCMatrix and you need 2M in order to train a passable model.
Do not use only CCMatrix, find some corpora of better quality among the list that opus offers.

As of using backtranslation, only do this when you do not have enough data to train a single pair. The translations will always be better even after pivoting from Spanish than using poor backtranslations in the training set to train directly.

I have tried augmenting the dataset for german with backtranslations, to no avail, so I cannot see what it’ll bring to italian which features plenty of data to train from, but suit yourselves…

urtzai · September 19, 2024, 9:13pm

Wow nice! And how’s the state of that work? Can I know more about it? Maybe we can discuss in another topic

urtzai · September 19, 2024, 9:28pm

Thanks for the details!

I’m a web python programmer / devops engineer and these are my first steps into IA, GPU renting, NLP stuff… but I will check it out!

Good stuff! Very appreciated!