Thammegowda/mtdata: A tool that locates, downloads, and extracts machine translation corpora

MTData

image Travis (.com)

MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.

This tool knows:

  • From where to download data sets: WMT News Translation tests and devs for Paracrawl, Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS …
  • How to extract files : .tar, .tar.gz, .tgz, .zip, …
  • How to parse .tmx, .sgm and such XMLs, or .tsv … Checks if they have same number of segments.
  • Whether parallel data is in one .tsv file or two sgm files.
  • Whether data is compressed in gz, xz or none at all.
  • Whether the source-target is in the same order or is it swapped as target-source order.
  • How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
    • New in v0.3: BCP-47 like language ID: (language, script, region)
  • Download only once and keep the files in local cache.
  • (And more of such tiny details over the time.)

MTData is here to:

  • Automate machine translation training data creation by taking out human intervention. This is inspired by SacreBLEU that takes out human intervention at the evaluation stage.
  • A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.
Source Dataset Count
OPUS 156,257
Flores 51,714
Microsoft 8,128
Leipzig 5,893
Neulab 4,455
Statmt 1,798
Facebook 1,617
AllenAi 1,611
ELRC 1,575
EU 1,178
Tilde 519
LinguaTools 253
Anuvaad 196
AI4Bharath 192
ParaCrawl 127
Lindat 56
Google 55
UN 30
JoshuaDec 29
StanfordNLP 15
ParIce 8
LangUk 5
KECL 4
Phontron 4
NRC_CA 4
IITB 3
WAT 3
Masakhane 2
Total 235,731
1 Like

Cool! Couldn’t help but notice: doesn’t OPUS already include certain datasets like ELRC, EU, Tilde, Paracrawl? Did they double-count on accident? Or maybe it’s just the way they list the dataset information.

1 Like

They’re probably just counting possible sources, so some of these sources are within OPUS itself which is a collection of everything, yeah.

1 Like