AI And IP: Copyright Law Prof Talks About Training Models, 'Snoopy Problem'

The value of artificial intelligence (AI) will be shaped by intellectual property law and policy, according to analysts at New Street Research.

Matthew Sag, professor of law of AI, machine learning (ML) and data science at Emory University Law School, last week joined analysts at New Street Research on a virtual video call to answer important questions.

Some of those touched on topics such as how IP law will affect the way platforms can train models, what IP law says about AI outputs, and how the use of AI affects a person’s ability to secure a patent.

One analyst pointed to a recent podcast where they discussed an “extinction-level event for AI.” They compared it to the Napster, a 1999 peer-to-peer (P2P) file-sharing application that allowed users to share music for free over the internet, which dramatically affected the music industry.



Sag thinks the chances of an “IP war asteroid” hitting the AI, ML, GAI spaces are incredibly small. “I don’t want to say they are not big issues, because they are and people need to take copyrights seriously,” he said. “But are we headed for an extinction-level event? My money is solidly on no.”

The basis of a copyright claim is simple. The person needs to be the owner of a valid copyright and show that without authorization made a copy of that work.

“There’s absolutely no doubt that training a large language model or image model requires companies to scrap the internet and store many copies in local servers to expose them to an algorithm they progressively train,” Sag said. “These are invisible technical copies that if the algorithms are trained correctly no one will ever see.”

Sag said the legal question in the U.S. remains that the invisible technical copies that do not results in any “significant communication of the underlying expression to a new audience, they are admissible.”

There are several court cases in progress where previous verdicts and law may or may not pertain.

When OpenAI trained GPT-4 the company downloaded hundreds of millions of electronic information scrapped from the internet and downloaded onto its servers. The best excuse they had for copying the information would be fair use. The inputs and the outputs are linked.

If someone built a machine-learning model that was a way to give people access to the content, it wouldn’t be fair use. If the information that came from the model were different from the inputs, it could create an entirely different scenario, despite the inputs and the outputs being linked.

“The critical question becomes what is going on inside the model,” he said. “Is the model just learning abstractions, patterns, and relationships -- learning how to write like a good journalist. Or is the model memorizing huge tracks of New York Times content in a way that can be retrievable?”

If the inputs don’t look too much like the outputs, it can be justified.

When asked by New Street Research analysts if someone can secure copyrights and patents for AI-generated work, Sag said

“There’s no way under current law that AI itself can be an author or an inventor,” he said.

The more interesting question to consider, he said, is when people use AI tools to create work will U.S. law recognize the people inputting the information as the author in a copyright sense and inventor in a patent sense.

“The copyright office is struggling with that issue,” he said. “The answer has to be at some point, yes, we have to recognize the people, but the copyright office is still trying to figure out where the line is. The same debate is going on in patent law.”

He said the existential question is whether or not the entire training strategy legitimist.

Does the nature of the training content affect fair use?

Sag recently updated his paper outlining The Snoopy Problem, which means the more an abstract copyrighted work is protected, the more likely it is that a generative AI model will “copy” it.  In this case, it is about the problem of "copyrightable characters."

Sag discusses in this paper where generative AI may run off the tracks related to copyrights and how developers can help correct problems related to copyright infringement.

Someone can produce an infringing image of Snoopy that would infringe though it wouldn’t look like any specific image in the training data. It doesn’t happen in novels, he said.

If someone uses ChatGPT, which has access to data on a variety of publisher sites, to summarize the morning news, is this infringement?

An important point in understanding the New York Times case, where the publisher is suing OpenAI for using its published data to train its models, if it turns out that GPT4 and GPT5 is just a ZIP file of the training data, and it’s not too difficult to say GPT tell me this and it regurgitates the training data, fair use does not apply.

The New York Times compliant has a lot of examples of things that look like memorization, Sag said, but it’s unclear as to how many are things the model memorized or how much as an agent it has gone out and found.
“I think most are related to GPT-4 acting like an agent, which is still a copyright question, but it is a much smaller question,” Sag said.

Next story loading loading..