Public Datasets

A thread to catalog publicly available datasets for machine translation

Opus is the primary data source for Argos Translate, it organizes many different machine translation datasets and is searchable by language pair.


Wikiextract has Wiktionary data available for download. Some Wikiextract data was used for training Argos Translate models, but translating single words with Argos Translate has chronically underperormed making dictionary data especially helpful.

(Run by the guy who invented ssh)

The Pile

The Pile is a very large corpus of language data release by Eleuther AI who’s GPT-NeoX-20b is, to my knowledge, the largest open source language model currently available.

Argos Translate currently does not use Pile data, but it could be used for training few shot translation models.

Librivox S2S

Automatically mined Speech-to-Speech translations

The Stack

A large dataset of permissively licensed code.