As part of the BigCode project, we released and will maintain The Stack, a 6.4 TB dataset of permissively licensed source code in 358 programming languages, along with a collection of datasets created through the course of research during the project. Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset ...| BigCode
We’re on a journey to advance and democratize artificial intelligence through open source and open science.| huggingface.co
September 26, 2022: Announcement of the BigCode project. October 6, 2022: Webinar with the BigCode Community to provide strategic direction. October 27, 2022: Introduction of “The Stack” dataset and paper publication. November 15, 2022: Introduction of “Am I in The Stack” tool and BigCode Opt-Out process. November 23, 2022: Details shared on the approach to de-identification of personally identifiable information (PII). November 29, 2022: Sharing of Weights and Biases dashboards ...| BigCode
We’re on a journey to advance and democratize artificial intelligence through open source and open science.| huggingface.co