A thread to catalog publicly available datasets for machine translation
Opus
Opus is the primary data source for Argos Translate, it organizes many different machine translation datasets and is searchable by language pair.
Wikiextract
Wikiextract has Wiktionary data available for download. Some Wikiextract data was used for training Argos Translate models, but translating single words with Argos Translate has chronically underperormed making dictionary data especially helpful.
(Run by the guy who invented ssh
)
The Pile
The Pile is a very large corpus of language data release by Eleuther AI who’s GPT-NeoX-20b is, to my knowledge, the largest open source language model currently available.
Argos Translate currently does not use Pile data, but it could be used for training few shot translation models.
Librivox S2S
Automatically mined Speech-to-Speech translations
The Stack
A large dataset of permissively licensed code.