Commentary

Take The Best, Leave The Rest: LLM Firms Steal High-Quality Content Only

by Ray Schultz , Columnist, November 12, 2024

Publishers of first-rate content are more likely to have their work used for training of large language models (LLMs) than those that put out inferior material, according to a study from Ziff Davis.

“Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites,” the study notes. “Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology.”

Of course, publishers are unhappy with this—some, like The New York Times and eight Alden Global Capital titles, have filed suit against OpenAI and Microsoft.

“LLM company training data disclosures—largely dating to earlier, pure-research periods of the technology’s evolution—and analysis of public training datasets show long-running exploitation of high-quality publisher content (extremely lucrative for the LLM companies) and imply lost licensing revenue from some of the world’s most highly-valued companies,” the study states.

The authors identified the following set of high-quality publishers, with relevant subsidiary brands in in parentheses: Advance (Conde Nast, Advance Local), Alden Global Capital (Tribune Publishing, MediaNews Group), Axel Springer, Bustle Digital Group, Buzzfeed, Inc., Future plc, Gannett, Hearst, IAC (Dotdash Meredith and other divisions), News Corp, The New York identified Times Company, Penske Media Corporation, Vox Media, The Washington Post,and Ziff Davis.

The study also notes, “As LLMs have evolved from pure research projects to some of the most valuable IP assets on earth, LLM companies have ceased publishing training details, and publishers have brought litigation against them. Courts and policymakers are grappling with questions of IP rights and technological progress.”

What should you do?

We’re not saying you have to dumb down your content to avoid this form of theft, but you and your lawyers should be aware of it. There will no doubt be other lawsuits filed.

The study was written by George Wukoson, lead attorney on AI Matters, and Joe Fortune, chief technology officer for Ziff Davis.

artificial intelligence, generative ai, legal, publishing

Next story loading

About the Author

Ray Schultz is the former editor of DM News, Chief Marketer, Direct, Circulation Management and other marketing titles.