The ongoing Kadrey v. Meta Platforms, Inc lawsuit accuses the tech giant of using copyrighted materials to train its artificial intelligence models. A few months ago, it was revealed that Meta CEO Mark Zuckerberg authorized the use of pirated books. New evidence recently emerged to support these claims.
Unsealed emails. Appendix A of the case includes several emails from Meta employees that reveal a significant number of downloads of copyrighted books. One employee named Melanie Kambadur expressed her refusal to participate in this kind of data collection in October 2022.
“Torrenting from a corporate laptop doesn’t feel right,” Nikolay Bashlykov, a Meta engineer responsible for this data collection, said in an April 2023 message. He added that the company needed to be cautious about the IP address from which they downloaded the materials.
Meta knew the risks. In September 2023, Bashlykov cautioned that torrenting could lead to “seeding,” which “could be legally not OK.” These internal discussions suggest that Meta recognized this type of activity as unlawful, according to authors who have sued the company.
Covering its tracks. In an internal message, Meta researcher Frank Zhang said that the company took measures to avoid using its servers when downloading the data set. This was to prevent anyone from being able to trace the seeding and the entity downloading the content.
81.7 TB of data. According to Ars Technica, new evidence indicates that Meta downloaded at least 81.7 TB of data from several libraries that offered copyrighted books via torrents. A recent document from the ongoing legal process revealed that at least 35.7 TB were downloaded from sites like Z-Library or LibGen (which was shut down in the summer).
Meta seeks to dismiss the allegations. The company has filed a motion to dismiss these charges. Meta claims there’s no evidence that any of its employees downloaded books via torrents or that they were distributed by Meta. Xataka has contacted the company for comments on the case and will update this post if we receive a response.
Plundering the Internet. This issue highlights the questionable practices that AI companies employ to train their models. It happened with Google, which updated its privacy policy in 2023 to say that it’ll “use publicly available information to help train Google’s AI models.” It’s also evident with OpenAI, which used millions of texts, many of them copyrighted, to train ChatGPT. Perplexity recently came under scrutiny for bypassing the “rules of the Internet” to feed its AI model.
Internet theft is being normalized. What’s remarkable is that as companies increasingly skirt the rules and violate copyright, this behavior is starting to be seen as normal. There seems to be little time for outrage, and people often treat it as an accepted practice so they can continue with their business.
Is this really “fair use”? Many companies rely on the concept of “fair use,” which allows for limited use of protected material without requiring permission. While copyright infringement lawsuits are emerging in the world of generative AI, they often seem to take a backseat as these large companies continue to thrive.
Image | Florencia Viadana
Related | What Are Distilled AI Models? A Look at LLM Distillation and Its Outputs
View 0 comments