MTData
MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.
This tool knows:
- From where to download data sets: WMT News Translation tests and devs for Paracrawl, Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS …
- How to extract files : .tar, .tar.gz, .tgz, .zip, …
- How to parse .tmx, .sgm and such XMLs, or .tsv … Checks if they have same number of segments.
- Whether parallel data is in one .tsv file or two sgm files.
- Whether data is compressed in gz, xz or none at all.
- Whether the source-target is in the same order or is it swapped as target-source order.
- How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
- New in v0.3: BCP-47 like language ID: (language, script, region)
- Download only once and keep the files in local cache.
- (And more of such tiny details over the time.)
MTData is here to:
- Automate machine translation training data creation by taking out human intervention. This is inspired by SacreBLEU that takes out human intervention at the evaluation stage.
- A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.
| Source | Dataset Count |
|---|---|
| OPUS | 156,257 |
| Flores | 51,714 |
| Microsoft | 8,128 |
| Leipzig | 5,893 |
| Neulab | 4,455 |
| Statmt | 1,798 |
| 1,617 | |
| AllenAi | 1,611 |
| ELRC | 1,575 |
| EU | 1,178 |
| Tilde | 519 |
| LinguaTools | 253 |
| Anuvaad | 196 |
| AI4Bharath | 192 |
| ParaCrawl | 127 |
| Lindat | 56 |
| 55 | |
| UN | 30 |
| JoshuaDec | 29 |
| StanfordNLP | 15 |
| ParIce | 8 |
| LangUk | 5 |
| KECL | 4 |
| Phontron | 4 |
| NRC_CA | 4 |
| IITB | 3 |
| WAT | 3 |
| Masakhane | 2 |
| Total | 235,731 |