A thread to catalog publicly available datasets for machine translation
Opus is the primary data source for Argos Translate, it organizes many different machine translation datasets and is searchable by language pair.
Wikiextract has Wiktionary data available for download. Some Wikiextract data was used for training Argos Translate models, but translating single words with Argos Translate has chronically underperormed making dictionary data especially helpful.
(Run by the guy who invented
Argos Translate currently does not use Pile data, but it could be used for training few shot translation models.
Automatically mined Speech-to-Speech translations
A large dataset of permissively licensed code.