Public Datasets

argosopentech · April 25, 2022, 11:38pm

A thread to catalog publicly available datasets for machine translation

argosopentech · April 25, 2022, 11:41pm

Opus

Opus is the primary data source for Argos Translate, it organizes many different machine translation datasets and is searchable by language pair.

argosopentech · April 25, 2022, 11:48pm

Wikiextract

Wikiextract has Wiktionary data available for download. Some Wikiextract data was used for training Argos Translate models, but translating single words with Argos Translate has chronically underperormed making dictionary data especially helpful.

(Run by the guy who invented ssh)

argosopentech · April 25, 2022, 11:55pm

The Pile

The Pile is a very large corpus of language data release by Eleuther AI who’s GPT-NeoX-20b is, to my knowledge, the largest open source language model currently available.

Argos Translate currently does not use Pile data, but it could be used for training few shot translation models.

argosopentech · May 5, 2022, 11:09pm

Librivox S2S

Automatically mined Speech-to-Speech translations

argosopentech · November 2, 2022, 10:24pm

The Stack

A large dataset of permissively licensed code.