Reddit Has Become the Best Source of Human Data on the Internet. AI Companies Are Trying to Steal It

Everyone wants Reddit’s data—because it’s human data. That’s the value of a platform that has become the crown jewel for AI companies, who want to use the content to train their models. Reddit is fed up with companies trying to do so without permission or payment.

Reddit is suing Anthropic. The social media platform has filed a lawsuit against Anthropic, the maker of Claude, for breach of contract and for engaging in “unlawful and unfair commercial acts” by using Reddit’s platform and data without authorization. In other words, Reddit accuses Anthropic of stealing its data for AI development.

Strong criticism. Reddit’s legal team opens the lawsuit with harsh words: “Anthropic is a late-blooming artificial intelligence (‘AI’) company that bills self as the white knight of the Al industry. It is anything but.” Reddit says Anthropic maintains a public image of legality and responsibility, while privately ignoring any rules that stand in the way of profit.

Reddit is a treasure trove of human data. Reddit has become a vital source of human knowledge. For answers, experience, and raw opinions, Reddit stands as the standard. The company knows its worth. Its legal director, Ben Lee, told The Verge:

“Reddit’s humanity is uniquely valuable in a world flattened by AI. Now more than ever, people are seeking authentic human-to-human conversation. Reddit hosts nearly 20 years of rich, human discussion on virtually every topic imaginable. These conversations don’t happen anywhere else—and they’re central to training language models like Claude.”

Reddit began protecting its content early. Recognizing the value of its “human data,” Reddit moved quickly to protect it. A few months after ChatGPT launched, Reddit capped its API—much like Tesla CEO Elon Musk had done with X/Twitter. The controversial move aimed to shield the platform from AI companies. Lawsuits followed.

More from Xataka On

Anthropic Adds a Warning to Its Job Applications: ‘Please Do Not Use AI Assistants’

The message was clear: If you want my data, pay. Reddit’s stance is clear, and some companies have taken notice. Google was among the first to strike a deal, agreeing to pay $60 million to access Reddit data for AI training. OpenAI also made an agreement, although the terms remain undisclosed.

Anthropic disagrees. In a statement to CNBC, Anthropic said, “We disagree with Reddit’s claims and will defend ourselves vigorously.” Interestingly, Anthropic has restricted Claude from accessing Windsurf, an AI programming startup recently acquired by OpenAI. One of its co-founders said, “I think it would be odd for us to be selling Claude to OpenAI.” That may be a reasonable argument in that case—but it’s a harder sell when applied to Reddit.

More lawsuits are piling up. Anthropic’s legal troubles aren’t limited to Reddit. In August, three authors sued the company in California federal court, claiming it built a multi-million-dollar business by using hundreds of thousands of copyrighted books without permission. In October 2023, Universal Music sued Anthropic in Tennessee for “systematically and widely infringing” on song lyrics. That case ended in Anthropic’s favor, a troubling win for AI companies pushing copyright boundaries.

The looting of the internet continues. Reddit’s lawsuit is just another chapter in the growing story of AI companies scraping the internet. No site is truly safe. Some cases, like Perplexity or Meta’s mass downloads of copyrighted books, stand out more than others. If data can make their models better, companies will go after it—and that’s exactly what’s happening with Reddit.

AI companies are ignoring copyright law. This trend reflects a disturbing reality: Even as they repeatedly infringe on copyrights, AI companies face little accountability. OpenAI has pushed for broad legal protections to operate freely, and others have joined the call to weaken copyright laws for AI training. “Fair use” remains their key legal defense. Yet time passes without consequences, and the internet continues to be stripped for parts.

Image | Brett Jordan (Unsplash)

Reddit Has Become the Best Source of Human Data on the Internet. AI Companies Are Trying to Steal It

Reddit just sued Anthropic for using its content to train AI models.

This is the latest example of how AI companies continue to plunder data from the Internet.

Reddit has become a treasure trove of “human” data, which is extremely valuable for training models.

RECEIVE "Xatakaletter", OUR WEEKLY NEWSLETTER