The landscape of artificial intelligence (AI) training is undergoing a significant shift as web domains increasingly restrict access to their data. A new report titled "Consent in Crisis: The Rapid Decline of the AI Data Commons," led by Shayne Longpre and a team of contributors, highlights the growing challenges AI developers face due to these restrictions. Conducted over a year-long audit from 2023 to 2024, the study examines how web domains' consent preferences are evolving and the implications for AI training corpora.

The report's findings are based on an audit of 14,000 web domains, focusing on their policies regarding web crawlers and AI data usage. Conducted by a team of researchers, the study was driven by the need to understand the changing dynamics of web data accessibility and consent. The audit revealed a significant increase in AI-specific clauses limiting data use and inconsistencies between websites' Terms of Service and their robots.txt files. This shift is indicative of broader issues with existing web protocols, which were not designed to handle the massive re-purposing of internet data for AI training.

Findings

The findings reveal a rapid increase in restrictions on AI data usage from web domains. Key points include:

  • Around 5% of tokens from major AI training corpora like C4, RefinedWeb, and Dolma have become restricted due to changes in robots.txt files, with 45% of C4 tokens restricted by Terms of Service agreements.
  • Inconsistencies between robots.txt files and Terms of Service documents indicate inefficiencies in current web protocols for communicating consent.
    The restrictions are leading to a potential bias in AI training data, reducing diversity and freshness.

Proliferation of Restrictions

The study's results indicate a rapid proliferation of restrictions on web crawlers used for AI development. In just one year, around 5% of tokens from major AI training corpora like C4, RefinedWeb, and Dolma have become restricted due to changes in robots.txt files. Among the most actively maintained sources, this figure rises to 28%. Additionally, 45% of C4 is now restricted by Terms of Service agreements. If these restrictions are respected or enforced, they could significantly bias the diversity, freshness, and scalability of general-purpose AI systems.

Inconsistencies in Consent Mechanisms

A critical finding of the report is the inconsistency between robots.txt files and Terms of Service. These inconsistencies reveal the inefficiencies of current web protocols in communicating content creators' intentions. For example, OpenAI's crawlers face more significant restrictions compared to those of other AI developers. This discrepancy highlights the need for more effective mechanisms to convey consent preferences and manage the impact on AI training data.

Impact on AI Training Data

The increasing restrictions are reshaping the landscape of AI training data. The head distribution of web domains—those contributing the most tokens to AI corpora—differs significantly from the long tail. These top sources, including news, encyclopedias, and social media sites, are more likely to have user-generated content, multimedia elements, and monetized content. As restrictions increase, AI training data may become less representative of the current web, skewing towards older and less diverse content.

Mismatch Between AI Uses and Training Data

Another notable aspect of the report is the mismatch between the types of web data used for AI training and the real-world applications of conversational AI. For instance, while a significant portion of training data comes from news websites, user interactions with AI systems like ChatGPT often involve creative writing, brainstorming, and general information requests—areas that are less represented in web-derived training data. This misalignment could affect the performance and alignment of AI models with user expectations.

Future Trends and Challenges

The report's forecasts suggest a continued decline in open web data sources. By April 2025, an additional 2-4% of C4, RefinedWeb, and Dolma tokens are expected to become restricted. This trend underscores the urgent need for better protocols to manage web data consent. Without such mechanisms, the availability of high-quality training data will diminish, challenging the scalability and capabilities of future AI models.

Conclusion

The rapid decline of the AI data commons presents significant challenges for AI development. As web domains increasingly restrict access to their data, the diversity and freshness of AI training data are at risk. The inconsistencies in consent mechanisms further complicate this issue, highlighting the need for improved protocols to manage web data usage. Addressing these challenges is crucial to ensuring the continued advancement and ethical use of AI technologies.