As part of the BigCode project, we released and will maintain The Stack, a 6.4 TB dataset of permissively licensed source code in 358 programming languages, along with a collection of datasets created through the course of research during the project. Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset ...| BigCode
We believe the soul of BigCode to be clear and transparent communication striving towards open collaboration. The project, therefore, runs under the following set of open and permissive licenses. Datasets. We value openness and transparency about the training data of LLMs and intend to release datasets whenever we have the rights to do so. We will also provide data cards for all datasets we release. Please see the Dataset Card for The Stack.| BigCode
We are excited to invite AI practitioners from diverse backgrounds to join the BigCode project! Note that BigCode is a research collaboration and is open to participants who have a professional research background and are able to commit time to the project. In general, we expect applicants to be affiliated with a research organization (either in academia or industry) and work on the technical/ethical/legal aspects of LLMs for coding applications.| BigCode
BigCode is an open scientific collaboration working on the responsible development and use of large language models for code (Code LLMs), empowering the machine learning and open source communities through open governance. One of the challenges typically faced by researchers working on Code LLMs is the lack of transparency around the development of these systems. While a handful of papers on code LLMs have been published, they do not always give full insight into the development process, whic...| BigCode
11/15 The Verge: The scary truth about AI copyright is nobody knows what will happen next 11/4 VentureBeat: How Hugging Face and ServiceNow tackle code-generating LLM challenges 10/31 Import AI: Want to train a big code model AND not annoy developers? ‘The Stack’ might be the dataset for you 9/30 SD Times Open-Source Project of the Week: BigCode 9/26 TechCrunch: Hugging Face and ServiceNow launch BigCode, a project to open source code-generating AI systems| BigCode
Sponsors # BigCode is a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources to ensure that the collaboration runs smoothly and makes progress towards the stated goals. ServiceNow Research and Hugging Face have made their respective compute clusters available for large-scale training of the BigCode models, and Hugging Face hosts the datasets, models, and related applications from the community...| BigCode
StarCoder # Paper: A technical report about StarCoder. GitHub: All you need to know about using or fine-tuning StarCoder. StarCoder: StarCoderBase further trained on Python. StarCoderBase: Trained on 80+ languages from The Stack. StarCoder+: StarCoderBase further trained on English web data. StarEncoder: Encoder model trained on TheStack. StarPii: StarEncoder based PII detector. StarCoder Tools & Demos # StarCoder Playground: Write with StarCoder Models! VSCode Extension: Code with StarCoder!...| BigCode
September 26, 2022: Announcement of the BigCode project. October 6, 2022: Webinar with the BigCode Community to provide strategic direction. October 27, 2022: Introduction of “The Stack” dataset and paper publication. November 15, 2022: Introduction of “Am I in The Stack” tool and BigCode Opt-Out process. November 23, 2022: Details shared on the approach to de-identification of personally identifiable information (PII). November 29, 2022: Sharing of Weights and Biases dashboards ...| BigCode
Our Pledge # We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation. We pledge to act and interact in ways that contribute to an open, welcom...| BigCode