Meta reveals it trained its AI with 81.7TB of pirated content



Meta, which develops the large-scale language model 'LLaMA,' was

sued in July 2023 for 'training AI using copyrighted books.' In this trial, evidence was presented that Meta trained LLaMA using approximately 81.7 TB of data stored in pirated e-book libraries such as Z-Library and Anna's Archive .

Kadrey-v-Meta-Motion-for-Relief-Appendix-A-2-5-25.pdf
(PDF file) https://cdn.arstechnica.net/wp-content/uploads/2025/02/Kadrey-v-Meta-Motion-for-Relief-Appendix-A-2-5-25.pdf

“Torrenting from a corporate laptop doesn't feel right”: Meta unsealed - Ars Technica
https://arstechnica.com/tech-policy/2025/02/meta-torrented-over-81-7tb-of-pirated-books-to-train-ai-authors-say/



'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' * TorrentFreak

https://torrentfreak.com/meta-torrented-over-81-tb-of-data-through-annas-archive-despite-few-seeders-250206/

Comedian and author Sarah Silverman and authors Christopher Golden and Richard Cadeley sued OpenAI and Meta in July 2023, alleging that ChatGPT and LLaMA were trained on datasets that were illegally distributed on the Internet.

OpenAI and Meta sued by three authors for copyright infringement - GIGAZINE



In January 2025, a Meta employee admitted to removing copyright information from a dataset based on the pirate e-book library Library Genesis (LibGen), and internal company documents revealed that Meta had officially approved the use of LibGen.

Meta CEO Mark Zuckerberg is being pursued in a lawsuit for allowing the AI 'Llama' development team to use copyrighted works without permission - GIGAZINE



Furthermore, in February 2025, the plaintiffs criticized, 'The scale of Meta's illegal AI training is astonishing. In spring 2024 alone, Meta obtained at least 81.7 TB of data from multiple pirated e-book libraries through a site called Anna's Archive, including at least 35.7 TB of data in Z-Library and LibGen.' The plaintiffs also pointed out that Meta obtained 80.6 TB of data from LibGen.

In the past, Meta has consistently argued that 'AI training using LibGen is fair use .' However, Meta avoided the risk of identifying Meta as the data acquirer by not using Facebook's infrastructure to download the dataset, as revealed in emails (PDF file) . Therefore, the plaintiffs argued that 'Meta knew that collecting data from pirated e-book libraries was illegal.'



Meta, on the other hand, has stated that 'Plaintiffs have not reported a single instance in which any of their Books were actually downloaded from Meta by a third party via a pirate e-book library, much less allege that Plaintiffs' Books were in any way distributed by Meta,' and has asked for the plaintiffs' claims to be dismissed.

in Web Service, Posted by log1r_ut