The Pile Dataset

argosopentech · September 5, 2022, 10:22pm

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

I think this dataset by Eleuther AI is probably the highest quality large text dataset publicly available.

argosopentech · September 5, 2022, 10:23pm

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

pierotofy · September 6, 2022, 3:00pm

Woah, this is great.

argosopentech · September 6, 2022, 9:46pm

It really is, I think it’s improved since I last looked at it. Reading the paper it looks like they have pretty high quality data (Hacker News, Project Gutenberg, Wikipedia, etc.) and a lot of it. They’re also thoughtful about dealing with copyright, they’re practical but try to use data that the authors intended to be open.

It’s more text data then I can currently use, so we shouldn’t have any shortage for the foreseeable future. It’s probably very English centric but also should have a lot of multilingual data. Translation data, especially for smaller languages, is still very valuable though.

argosopentech · September 6, 2022, 10:03pm

I’ve done some experiments training models on unstructured text data by splitting sentences and “translating” to recover the second half of the sentence.

For example:

{"q":"I baked a cake ", "source":"auto", "target":"infer"}

↓

{"translatedText": "for my friend's birthday party."}

I haven’t had very good results but I think this would work well with more powerful models. This is similar to the AlexaTM model which used a translation style encoder-decoder architecture for generating text.