Public Datasets

The Pile

The Pile is a very large corpus of language data release by Eleuther AI who’s GPT-NeoX-20b is, to my knowledge, the largest open source language model currently available.

Argos Translate currently does not use Pile data, but it could be used for training few shot translation models.