MTData
MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.
This tool knows:
- From where to download data sets: WMT News Translation tests and devs for Paracrawl, Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS …
- How to extract files : .tar, .tar.gz, .tgz, .zip, …
- How to parse .tmx, .sgm and such XMLs, or .tsv … Checks if they have same number of segments.
- Whether parallel data is in one .tsv file or two sgm files.
- Whether data is compressed in gz, xz or none at all.
- Whether the source-target is in the same order or is it swapped as target-source order.
- How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
- New in v0.3: BCP-47 like language ID: (language, script, region)
- Download only once and keep the files in local cache.
- (And more of such tiny details over the time.)
MTData is here to:
- Automate machine translation training data creation by taking out human intervention. This is inspired by SacreBLEU that takes out human intervention at the evaluation stage.
- A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.
Source | Dataset Count |
---|---|
OPUS | 156,257 |
Flores | 51,714 |
Microsoft | 8,128 |
Leipzig | 5,893 |
Neulab | 4,455 |
Statmt | 1,798 |
1,617 | |
AllenAi | 1,611 |
ELRC | 1,575 |
EU | 1,178 |
Tilde | 519 |
LinguaTools | 253 |
Anuvaad | 196 |
AI4Bharath | 192 |
ParaCrawl | 127 |
Lindat | 56 |
55 | |
UN | 30 |
JoshuaDec | 29 |
StanfordNLP | 15 |
ParIce | 8 |
LangUk | 5 |
KECL | 4 |
Phontron | 4 |
NRC_CA | 4 |
IITB | 3 |
WAT | 3 |
Masakhane | 2 |
Total | 235,731 |