The Stack - a 3TB dataset of permissively licensed source code

https://www.servicenow.com/blogs/2022/big-code-collaboration-introduces-the-stack.html

Dataset collection took several months to complete. To create The Stack, 220.92 million unique GitHub source code repository names were collected from GH Archive, with 51.76 billion files (and 5.28 billion unique files) successfully downloaded from 137.36 million public and accessible repositories. The uncompressed size of all stored files is 92.36TB.

To this end, The Stack’s dataset was filtered to include only permissive licenses—i.e., those with minimal restrictions on how the software can be copied, modified, and redistributed (e.g., MIT and Apache 2.0). Copyleft licenses such as GPL are not included as they have the requirement that the same rights be preserved in derivative works. Some have argued that a model trained with copyleft licenses is considered derivative work.

When building software, snippets of code are often copied or configuration files reused with slightly altered settings. This leads to exact duplicates, as well as near duplicates, where two files are the same except for a few changes. Since training on duplicated data has a negative impact on model performance, these duplicates are removed from the dataset.

For the permissive-license dataset, the four biggest code bases are HTML (746GB), JavaScript (486GB), Java (271GB), and C (22GB). These comprise more than 55% of the total dataset size.