Thammegowda/mtdata: A tool that locates, downloads, and extracts machine translation corpora

argosopentech · August 20, 2025, 10:01am

MTData

MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.

This tool knows:

From where to download data sets: WMT News Translation tests and devs for Paracrawl, Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS …
How to extract files : .tar, .tar.gz, .tgz, .zip, …
How to parse .tmx, .sgm and such XMLs, or .tsv … Checks if they have same number of segments.
Whether parallel data is in one .tsv file or two sgm files.
Whether data is compressed in gz, xz or none at all.
Whether the source-target is in the same order or is it swapped as target-source order.
How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
- New in v0.3: BCP-47 like language ID: (language, script, region)
Download only once and keep the files in local cache.
(And more of such tiny details over the time.)

MTData is here to:

Automate machine translation training data creation by taking out human intervention. This is inspired by SacreBLEU that takes out human intervention at the evaluation stage.
A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.

Source	Dataset Count
OPUS	156,257
Flores	51,714
Microsoft	8,128
Leipzig	5,893
Neulab	4,455
Statmt	1,798
Facebook	1,617
AllenAi	1,611
ELRC	1,575
EU	1,178
Tilde	519
LinguaTools	253
Anuvaad	196
AI4Bharath	192
ParaCrawl	127
Lindat	56
Google	55
UN	30
JoshuaDec	29
StanfordNLP	15
ParIce	8
LangUk	5
KECL	4
Phontron	4
NRC_CA	4
IITB	3
WAT	3
Masakhane	2
Total	235,731

pierotofy · August 21, 2025, 6:12am

Cool! Couldn’t help but notice: doesn’t OPUS already include certain datasets like ELRC, EU, Tilde, Paracrawl? Did they double-count on accident? Or maybe it’s just the way they list the dataset information.

ArtanisTheOne · August 22, 2025, 3:50pm

They’re probably just counting possible sources, so some of these sources are within OPUS itself which is a collection of everything, yeah.