Taking a firm stance against tech companies’ unlicensed use of its authors’ works, the publishing giant Penguin Random House will change the language on all of its books’ copyright pages to expressly prohibit their use in training artificial intelligence systems, according to reporting by The Bookseller.
It’s a notable departure from other large publishers, such as academic printers Taylor & Francis, Wiley, and Oxford University Press, which have all agreed to license their portfolios to AI companies.
Matthew Sag, an AI and copyright expert at Emory University School of Law, said Penguin Random House’s new language appears to be directed at the European Union market but could also impact how AI companies in the U.S. use its material. Under EU law, copyright holders can opt-out of having their work data mined. While that right isn’t enshrined in U.S. law, the largest AI developers generally don’t scrape content behind paywalls or content excluded by sites’ robot.txt files. “You would think there is no reason they should not respect this kind of opt out [that Penguin Random House is including in its books] so long as it is a signal they can process at scale,” Sag said.
Dozens of authors and media companies have filed lawsuits in the U.S. against Google, Meta, Microsoft, OpenAI, and other AI developers accusing them of violating the law by training large language models on copyrighted work. The tech companies argue that their actions fall under the fair use doctrine, which allows for the unlicensed use of copyrighted material in certain circumstances—for example, if the derivative work substantially transforms the original content or if it’s used for criticism, news reporting, or education.
U.S. Courts haven’t yet decided whether feeding a book into a large language model constitutes fair use. Meanwhile, social media trends in which users post messages telling tech platforms not to train AI models on their content have been predictably unsuccessful.
Penguin Random House’s no-training message is a bit different from those optimistic copypastas. For one thing, social media users have to agree to a platform’s terms of service, which invariably allows their content to be used to train AI. For another, Penguin Random House is a wealthy international publisher that can back up its message with teams of lawyers.
The Bookseller reported that the publisher’s new copyright pages will read, in part: “No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems. In accordance with Article 4(3) of the Digital Single Market Directive 2019/790, Penguin Random House expressly reserves this work from the text and data mining exception.”
Tech companies are happy to mine the internet, particularly sites like Reddit, for language datasets but the quality of that content tends to be poor—full of bad advice, racism, sexism, and all the other isms, contributing to bias and inaccuracies in the resulting models. AI researchers have said that books are among the most desirable training data for models due to the quality of writing and fact-checking.
If Penguin Random House can successfully wall off its copyrighted content from large language models it could have a significant impact on the generative AI industry, forcing developers to either start paying for high-quality content—which would be a blow to business models reliant on using other people’s work for free—or try to sell customers on models trained on low-quality internet content and outdated published material.
“The endgame for companies like Penguin Random House opting out of AI training may be to satisfy the interests of authors who are opposed to their works being used as training data for any reason, but it is probably so that the publishing company can turn around and [start] charging license fees for access to training data,” Sag said. “If this is the world we end up in, AI companies will continue to train on the ‘open Internet’ but anyone who is in control of a moderately large pile of text will want to opt out and charge for access. That seems like a pretty good compromise which lets publishers and websites monetize access without creating impossible transaction costs for AI training in general.”