CodeCommons Logo Line
CodeCommons
CodeCommons Logo Line
Open, responsible, and transparent AI: Our shared goal
CodeCommons is an ambitious project to create the world’s most comprehensive digital commons for code
Orange Tag

Building on the existing foundation of Software Heritage, the largest publicly available source code archive, CodeCommons aims to bring into one place all the critical and qualified information needed to create smaller, better datasets for the next generation of AI tools.

At its core, the project prioritizes transparency and traceability, enabling model builders and users to respect creators' rights while promoting sovereign and sustainable AI.

Red Tag
Why CodeCommons?
Mobile Left Arrow
spa
Sustainability
Minimizing the environmental and economic costs associated with repeated data collection
Mobile Right Arrow
Our vision
What we're building
  • <1>
    The world’s largest source code commons Enriched with billions of files, historical development data, metadata, and contextual links to scientific literature.
  • <2>
    A scalable, unified data platform Allowing quick selection and extraction of specific code subsets designed for advanced AI training, with clear tracking using Software Heritage Identifiers (SWHIDs).
  • <3>
    Tools for ethical AI development Advanced tools to ensure compliance with copyright laws, assess code quality, and enhance AI reproducibility.
  • <4>
    Sustainable infrastructure Partner with GENCI supercomputers to enable large-scale training of next-generation models while reducing environmental impact.
  • <5>
    Clear principles To ensure the ethical use of the Software Heritage archive for AI training, any machine learning models (ML) trained on our archive must be made publicly available under an open license, along with the necessary documentation and tools. The specific training data must be clearly identified using SWHIDs, enabling the assessment of biases, verification of data inclusion, and attribution of generated code. Mechanisms need to be in place to allow legitimate authors to exclude their code from training sets.
In February 2024, the BigCode project released StarCoder2, a state-of-the-art open AI model trained using the GitHub repositories archived in Software Heritage. This noteworthy release is proof that it's possible to develop high-quality models while adhering to rigorous principles of transparency and openness.
Shaping the future of AI

CodeCommons isn't just a project; it's a movement towards an ethical, transparent, and accessible AI future. Together, we're laying the groundwork for the next generation of AI.

Join us

Join our community and help shape the future of AI: Sign up for our mailing list to stay informed and connected.

Resources
Contact
Questions about CodeCommons?
Partners and teams
Key partners Additional partners
Supported by