AI Firms Are Getting Around Web Protocol That Lets Publishers Block Scraping: Report

by Ray Schultz , June 23, 2024

The debate over content scraping took a new turn on Friday when TollBit, a content licensing startup, alleged that artificial intelligence companies are bypassing a web standard used by publishers to block scraping, Reuters reports.

The web standard is Robots Exclusion Protocol, or robots.txt, which was created in the 1990s to prevent websites from being overwhelmed with web crawlers, according to Reuters.

A Wired probe alleges that AI search startup Perplexity has likely bypassed efforts to block its web crawler via robots.txt, Reuters continues. And TollBit says its analytics show that numerous AI agents are bypassing the robots.txt protocol.

At the same time, Forbes has accused Perplexity of plagiarizing its stories—for instance, its article on former Google CEO Eric Schmidt’s drone project earlier this month.

Perplexity published its own story, containing “eerily similar wording” and an illustration from a prior Forbes story on Schmidt, Forbes states.

Publishers Daily could not independently confirm this charge at deadline But Forbes continues that the post, “which looked and read like a piece of journalism, didn’t mention Forbes at all.”

"Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists,” says Danielle Coffey, president of the News/Media Alliance, according to Reuters. “This could seriously harm our industry."

On another front, Coffee and others expressed wariness over Google’s AI-Assisted Search, which summarizes articles in greater detail.

“What Google’s trying to do is summarize things in which then, potentially, if you see the summary you don’t necessarily don’t dig down into the links that the search engine presents,” said Dave Hatter, cyber security consultant for intrustIT, during a broadcast on WVXU News in Cincinatti. “This could have a very negative effect on people’s traffic and ultimately ad revenue or whatever it is they hope to get out of those links.”

Coffey noted that these companies need the “quality vetted content” produced by news organizations to inform and train their models.

“Then they compete with our original content, with our audience, and could potentially put us out of business,” Coffee said. Given Google’s position as a dominant monopoly in search, “That could be catastrophic for us,” she added.

artificial intelligence, google, publishing

Next story loading