A Reddit thread with 779 upvotes asked a question that's been sitting under the surface of the AI industry for years: "So now scraping data without permission is bad for AI training all of a sudden?"
https://www.reddit.com/r/artificial/comments/1ugwccs/so_now_scraping_data_without_permission_is_bad/
It's a fair question, and it's not just a vibe. There's a paper trail.
TL;DR: AI labs built their products by scraping the internet without asking, then started blocking scrapers and suing anyone who did the same to them. This isn't just internet cynicism anymore. Anthropic paid $1.5 billion to settle a piracy lawsuit over training data. YouTube creators, including Ethan Klein, are actively suing Nvidia, Meta, ByteDance, Snap, and Apple over scraped video. California now legally requires AI developers to disclose their training data sources. For small teams, this is a vendor-risk signal worth checking, not just an internet argument.
The double standard, with receipts
Every major foundation model, GPT, Claude, Gemini, Llama, was built substantially on datasets assembled by crawling the open web, largely without asking the people who made that content. That was the industry's founding move, and for years it was treated as an engineering detail, not a legal exposure.
That posture is getting harder to hold. A few data points:
Anthropic paid $1.5 billion. Bartz v. Anthropic, a class action over roughly 500,000 books Anthropic downloaded from the pirated datasets LibGen and PiLiMi, received preliminary court approval for a $1.5 billion settlement in September 2025. That's not a scraping case exactly, it's a piracy case, but the underlying complaint is the same species: content taken without permission to train a commercial model.
YouTube creators are suing five companies at once. A group of YouTubers, Ethan Klein prominently among them, has filed related suits against Nvidia, Meta, ByteDance, Snap, and Apple, alleging their videos were scraped without consent to train AI systems. The Snap suit specifically calls out a dataset called HD-VILA-100M, built for academic research, allegedly repurposed for commercial model training. The Apple suit, filed in April 2026, alleges Apple downloaded millions of YouTube videos to train a generative video model. The plaintiffs' legal strategy is notably aggressive: they're leaning on DMCA anti-circumvention claims specifically to avoid a fair-use defense fight.
And then there's the OpenAI-DeepSeek irony, which is the cleanest example of the double standard in action. When DeepSeek's models showed capability jumps that looked suspiciously close to OpenAI's own outputs, OpenAI and Microsoft opened an investigation into whether DeepSeek had gained "unauthorized" access to OpenAI's data. The industry noticed the word choice. A company built on scraped web content was now the aggrieved party in an unauthorized-access story.
None of this means every AI company is acting in bad faith, or that scraping is categorically indefensible. It means the industry spent a decade treating "we didn't ask" as a non-issue for its own conduct, and is now treating the same behavior as a threat when it's aimed back at them.
The regulatory response is already here
This is where it stops being just an internet argument and starts being something small teams need to track. California's AB 2013, the AI Training Data Transparency law, took effect January 1, 2026. It requires developers of generative AI systems to publicly post a high-level summary of their training datasets, across 12 required categories, including whether the data included copyrighted material and whether it included personal information.
That's the direct regulatory answer to the Reddit thread's complaint. If AI labs won't voluntarily explain what they scraped, California now requires a version of that disclosure by law, at least for systems made available to Californians.
For a small team, the AB 2013 disclosure is not just a compliance checkbox for AI developers, it's a due diligence tool for AI buyers. If a vendor's training data disclosure shows heavy reliance on datasets already under active litigation, or if the vendor has no disclosure posted at all, that's a data point about how exposed that vendor is to a future forced change.
What this actually means for your vendor risk
The lawsuits above are not abstract industry drama. They're a preview of the same governance risk this site has already documented once this year: when a vendor gets hit with a legal or regulatory action tied to how their model was built, the fallout can land on customers who had nothing to do with it. The Anthropic export ban took two of Claude's most capable models offline for 17 days over a national-security issue that had nothing to do with any customer's usage. A copyright ruling that forces a vendor to retrain, remove a feature, or restrict a dataset could do something similar.
What to actually do with this: when evaluating or renewing an AI vendor, check whether they've published an AB 2013-style training data disclosure, even if your team isn't in California, since most vendors publish once and apply it broadly. Look at whether the vendor is named in active training-data litigation. Neither of these findings is disqualifying, most major vendors have some litigation exposure now, but a vendor with no disclosure and active undisclosed lawsuits is a materially different risk than one that's transparent about it.
Related Reading
- AI Vendor Contract Red Flags: What to Check Before Signing
- AI Vendor Due Diligence Checklist for Small Teams
- Anthropic Built a Model That Secretly Sabotaged You: The Fable 5 Backlash
- Anthropic Export Ban: What the 17-Day Fable 5 Shutdown Means for Your AI Vendor Policy
- Does Your AI Vendor Train on Your Data? 11-Vendor Policy Comparison
Sources: NPR, "Anthropic pays authors $1.5 billion to settle copyright infringement lawsuit", Copyright Alliance, Bartz v. Anthropic settlement overview, TechCrunch, "YouTubers sue Snap for alleged copyright infringement in training its AI models", Bloomberg Law, "Nvidia Faces Class Action Over Scraping YouTube to Train AI", Goodwin Law, California's AB 2013 takes effect, r/artificial Reddit discussion (June 27, 2026).
