CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

argosopentech · February 26, 2023, 5:48pm

https://opus.nlpl.eu/CCMatrix.php

CCMatrix is the largest dataset of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a dataset of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches.

Bilingual models [2]: To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT’19 test set for translation between English and German, Russian and Chinese, as well as German/French. In particular, our English/German system outperforms the best single one by close to 4 BLEU points and is almost on pair with best WMT’19 evaluation system which uses system combination and back-translation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2019 workshop on Asian Translation (WAT).

Multilingual models [3]: CCMatrix data is used to train M2M-100, a large-scale Many-to-Many multilingual translation model. The thousands of directions we mine produce training data for direct translations without relying solely on English data. We mine using novel strategy which exploits language groupings and bridge languages to avoid mining every possible direction while maintaining good accuracy. By training on this data and scaling model capacity through model parallelism and language-specific parameters, M2M-100 outperforms English-Centric multilingual models trained on data where either the source or target language is English. The system improves over 10 BLEU on average compared to an English-Centric baseline when translating directly between non-English directions. M2M-100 is competitive to bilingual models from WMT and improves over existing publicly available multilingual translation systems.

pierotofy · February 26, 2023, 6:47pm

For those wondering, to run the script you need to:

pip install git+https://github.com/kpu/kenlm.git
pip install --no-deps cc_net
pip install func_argparse
python dl_cc_matrix.py

The instructions on the repo are wrong.

pierotofy · February 26, 2023, 6:53pm

Also keep getting 403 errors:

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550249414450.79/wet/CC-MAIN-20190223001001-20190223023001-00275.warc.wet.gz

argosopentech · February 26, 2023, 6:54pm

I’ve never used the script to reproduce it. I just downloaded the compiled dataset from Opus.

pierotofy · February 26, 2023, 6:55pm

Ah, nice! Didn’t know opus already extracted it.

pierotofy · February 26, 2023, 7:04pm

It’s interesting because some translations are completely off:

https://opus.nlpl.eu/CCMatrix/v1/en-it_sample.html

E.g.

They followed the example of their Lord:
E, di più, perseguitano il loro Signore.

Is completely wrong lol, it should be:

Seguirono l'esempio del loro Signore

Other samples are good, while others are so so.

I wonder if for overall accuracy, a large number of “so so” samples beats a medium/small number of high quality samples.

argosopentech · February 26, 2023, 7:36pm

I think the tradeoff with CCMatrix is that’s it’s a huge dataset but it’s of mediocre quality. The XLEnt dataset is generated by finding higher quality translations in CCMatrix if you want a smaller but higher quality dataset.

Since CCMatrix is so large it can be cumbersome to work with. By default the Argos Train scripts excludes larger datasets for performance which means CCMatrix isn’t used.

I’m guessing more data, even if it’s of low quality, generally improves translations. You could also try training on the low quality data then fine tuning on higher quality data.