Commentary

Book-Loving AI: LLMs Are Being Trained On A New Source Of Content

OpenAI, which is fighting copyright suits by several major newspapers, is reportedly using books to train its large language models, including non-public, paywalled content — at least according to one report. 

The AI Disclosures Project conducted a study using a legally obtained dataset of 34 copyrighted O’Reilly Media books. 

The researchers used DE-COP (detecting copyrighted content in language) to assign an AUROC score (Area Under the Receiver Operating Characteristic). 

The study found that OpenAI’s more recent GPT-4o model shows recognition of paywalled O’Reilly book content, generating an 82% AUROC score. In contrast, its earlier GPT-3.5 Turbo model had an AUROC score of just over 50%.  

Ironically, non-paywalled conduct pulls an AUROC score of 64%, using GPT-4o.

Apart from that, GPT-4o Mini recorded AUROC scores of 55% on public data and 56% on non-public data. 

advertisement

advertisement

Now this is a very wonky study, and there is no guarantee that its findings would hold up in court.

OpenAI has not been formally charged with anything. 

Indeed, the study notes that its “book level AUROC estimates are uncertain, with high boot-strapped confidence intervals. This is likely due to the small sample size though rather than the estimator’s efficacy.” 

The takeaway? 

“Although the evidence presented here is specific to OpenAI and O’Reilly Media books, this is likely a systemic issue, and our findings aim to provoke changes in data collection and usage practices cross AI model developers.”

The study was written by Sruly Rosenblat, Tim O’Reilly and Ilan Strauss.

 

 

Next story loading loading..