What is the AI scraping hypocrisy debate about?

AI labs built their models by scraping web content without permission, then began blocking scrapers and suing companies for doing the same to them. A 779-upvote Reddit thread on r/artificial called this out directly, and the complaint is backed by real lawsuits and new disclosure laws.

What is California AB 2013?

A law effective January 1, 2026 requiring generative AI developers to publicly disclose a high-level summary of their training datasets, including whether copyrighted material or personal information was used, across 12 required categories of information.

What happened in the Anthropic copyright settlement?

Anthropic agreed to pay $1.5 billion to settle Bartz v. Anthropic, covering roughly 500,000 books downloaded from pirated datasets LibGen and PiLiMi. The court gave preliminary approval in September 2025. Authors receive about $3,000 per covered book.

Should small teams care about AI vendor training data lawsuits?

Yes, as a vendor-risk signal. A vendor facing active copyright litigation over training data can face court-ordered dataset changes, feature removal, or business disruption that affects your access to the tool, similar to the Anthropic export ban's 17-day outage.

AI Companies Built Empires by Scraping the…

Earth at night from space showing city lights connected across the globe, representing the scale of internet-wide data collection

A Reddit thread with 779 upvotes asked a question that's been sitting under the surface of the AI industry for years: "So now scraping data without permission is bad for AI training all of a sudden?"

https://www.reddit.com/r/artificial/comments/1ugwccs/so_now_scraping_data_without_permission_is_bad/

It's a fair question, and it's not just a vibe. There's a paper trail.

TL;DR: AI labs built their products by scraping the internet without asking, then started blocking scrapers and suing anyone who did the same to them. This isn't just internet cynicism anymore. Anthropic paid $1.5 billion to settle a piracy lawsuit over training data. YouTube creators, including Ethan Klein, are actively suing Nvidia, Meta, ByteDance, Snap, and Apple over scraped video. California now legally requires AI developers to disclose their training data sources. For small teams, this is a vendor-risk signal worth checking, not just an internet argument.

The double standard, with receipts

Every major foundation model, GPT, Claude, Gemini, Llama, was built substantially on datasets assembled by crawling the open web, largely without asking the people who made that content. That was the industry's founding move, and for years it was treated as an engineering detail, not a legal exposure.

That posture is getting harder to hold. A few data points:

Anthropic paid $1.5 billion. Bartz v. Anthropic, a class action over roughly 500,000 books Anthropic downloaded from the pirated datasets LibGen and PiLiMi, received preliminary court approval for a $1.5 billion settlement in September 2025. That's not a scraping case exactly, it's a piracy case, but the underlying complaint is the same species: content taken without permission to train a commercial model.

YouTube creators are suing five companies at once. A group of YouTubers, Ethan Klein prominently among them, has filed related suits against Nvidia, Meta, ByteDance, Snap, and Apple, alleging their videos were scraped without consent to train AI systems. The Snap suit specifically calls out a dataset called HD-VILA-100M, built for academic research, allegedly repurposed for commercial model training. The Apple suit, filed in April 2026, alleges Apple downloaded millions of YouTube videos to train a generative video model. The plaintiffs' legal strategy is notably aggressive: they're leaning on DMCA anti-circumvention claims specifically to avoid a fair-use defense fight.

And then there's the OpenAI-DeepSeek irony, which is the cleanest example of the double standard in action. When DeepSeek's models showed capability jumps that looked suspiciously close to OpenAI's own outputs, OpenAI and Microsoft opened an investigation into whether DeepSeek had gained "unauthorized" access to OpenAI's data. The industry noticed the word choice. A company built on scraped web content was now the aggrieved party in an unauthorized-access story.

None of this means every AI company is acting in bad faith, or that scraping is categorically indefensible. It means the industry spent a decade treating "we didn't ask" as a non-issue for its own conduct, and is now treating the same behavior as a threat when it's aimed back at them.

Two women in discussion across a table in an office setting

The regulatory response is already here

This is where it stops being just an internet argument and starts being something small teams need to track. California's AB 2013, the AI Training Data Transparency law, took effect January 1, 2026. It requires developers of generative AI systems to publicly post a high-level summary of their training datasets, across 12 required categories, including whether the data included copyrighted material and whether it included personal information.

That's the direct regulatory answer to the Reddit thread's complaint. If AI labs won't voluntarily explain what they scraped, California now requires a version of that disclosure by law, at least for systems made available to Californians.

For a small team, the AB 2013 disclosure is not just a compliance checkbox for AI developers, it's a due diligence tool for AI buyers. If a vendor's training data disclosure shows heavy reliance on datasets already under active litigation, or if the vendor has no disclosure posted at all, that's a data point about how exposed that vendor is to a future forced change.

What this actually means for your vendor risk

The lawsuits above are not abstract industry drama. They're a preview of the same governance risk this site has already documented once this year: when a vendor gets hit with a legal or regulatory action tied to how their model was built, the fallout can land on customers who had nothing to do with it. The Anthropic export ban took two of Claude's most capable models offline for 17 days over a national-security issue that had nothing to do with any customer's usage. A copyright ruling that forces a vendor to retrain, remove a feature, or restrict a dataset could do something similar.

What to actually do with this: when evaluating or renewing an AI vendor, check whether they've published an AB 2013-style training data disclosure, even if your team isn't in California, since most vendors publish once and apply it broadly. Look at whether the vendor is named in active training-data litigation. Neither of these findings is disqualifying, most major vendors have some litigation exposure now, but a vendor with no disclosure and active undisclosed lawsuits is a materially different risk than one that's transparent about it.

Sources: NPR, "Anthropic pays authors $1.5 billion to settle copyright infringement lawsuit", Copyright Alliance, Bartz v. Anthropic settlement overview, TechCrunch, "YouTubers sue Snap for alleged copyright infringement in training its AI models", Bloomberg Law, "Nvidia Faces Class Action Over Scraping YouTube to Train AI", Goodwin Law, California's AB 2013 takes effect, r/artificial Reddit discussion (June 27, 2026).

Earth at night from space showing city lights connected across the globe, representing the scale of internet-wide data collection

A Reddit thread with 779 upvotes asked a question that's been sitting under the surface of the AI industry for years: "So now scraping data without permission is bad for AI training all of a sudden?"

https://www.reddit.com/r/artificial/comments/1ugwccs/so_now_scraping_data_without_permission_is_bad/

It's a fair question, and it's not just a vibe. There's a paper trail.

TL;DR: AI labs built their products by scraping the internet without asking, then started blocking scrapers and suing anyone who did the same to them. This isn't just internet cynicism anymore. Anthropic paid $1.5 billion to settle a piracy lawsuit over training data. YouTube creators, including Ethan Klein, are actively suing Nvidia, Meta, ByteDance, Snap, and Apple over scraped video. California now legally requires AI developers to disclose their training data sources. For small teams, this is a vendor-risk signal worth checking, not just an internet argument.

AI Companies Built Empires by Scraping the Internet. Now They're Suing Anyone Who Scrapes Them Back.

The double standard, with receipts

The regulatory response is already here

What this actually means for your vendor risk

AI Companies Built Empires by Scraping the Internet. Now They're Suing Anyone Who Scrapes Them Back.

The double standard, with receipts

The regulatory response is already here

What this actually means for your vendor risk