I’ve started working on NLLU (https://github.com/LibreTranslate/nllu) with the goal of running inference on NLLB at scale (and cheaply) to generate a corpus of backtranslated data for a variety of languages.
I’ve started running inference on 15 million Paracrawl sentences from English → Italian as a first run. It should take about a week.
I plan to generate data for Polish and Dutch next.
CCMatrix trained a neural network to identify semantic similarity between sentences. Then they found parallel sentences that already existed on the web to use as translation data.
To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations
I think this project is a sort of improvement on the CCMatrix approach enabled by better large language models. Instead of using a neural network to identify existing sentence pairs in web crawl data you’re using a neural network to generate the translation.
What’s interesting about these datasets is that since they all use the same source text, they can be mapped directly to one another, e.g. it’s possible to distill a Italian <=> Dutch dataset.
Note: I’ve found an alignment issue due to a fault in nllu-server, which was causing some bitext pairs to be misaligned. I’ve fixed this issue for both Italian and Dutch and I’m re-uploading the new datasets, so I recommend re-downloading the datasets for better training results.
Hi all! I’m interested in creating a Basque dataset, due to lack of open Basque datasets.
But I wonder how much will cost to translate those 15 million Paracrawl sentences… how many machines did you need/rent to translate Italian/Dutch texts? And how many time did you spent? Can I ask you how many dollars are more or less?
For basque, you can train from spanish without any backtranslation, there are more than 6M sentences in CCMatrix and you need 2M in order to train a passable model.
Do not use only CCMatrix, find some corpora of better quality among the list that opus offers.
As of using backtranslation, only do this when you do not have enough data to train a single pair. The translations will always be better even after pivoting from Spanish than using poor backtranslations in the training set to train directly.
I have tried augmenting the dataset for german with backtranslations, to no avail, so I cannot see what it’ll bring to italian which features plenty of data to train from, but suit yourselves…