Content scraping is harming the information business in ways that could not have been foreseen.
Case in point: At least three major news organizations are blocking access to their
content by the Internet Archive’s Wayback Machine, a seemingly benign tool that helps people access content in the archive. They are The New York Times, The Guardian and
Reddit.
Why? They are concerned their content will be scraped by AI crawlers and used without permission.
“We are blocking the Internet Archive’s
bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization,” a New York
Times spokesperson told Nieman Lab.
Reddit and The Guardian have similar concerns.
“A lot of these AI businesses are looking for readily
available, structured databases of content,” said Robert Hahn, head of business affairs and licensing at The Guardian, Nieman Lab continues. “The Internet
Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”
advertisement
advertisement
It's hard to argue with that. But the Wayback Machine team has issued a
response that challenges this thinking while defending its own right to exist.
“The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository
library, has been building its archive of the world wide web since 1996,” writes Mark Graham, director of the Wayback Machine at the Internet Archive, in an op-ed piece on TechDirt.
“Today, the Wayback Machine provides access to thirty years’ worth of web history and culture,” Graham adds. “It has become an essential resource for journalists,
researchers, courts, and the public.”
All well and good. But what about that scraping concern? Estimates show that “the web-scraping market currently sits at $1.03 billion and is
projected to nearly double to $2 billion by 2030,” wrote Areejit Banerjee in a recent article in Corporate Compliance Insights.
Publishers are suing alleged
scrapers like OpenAI and Perplexity, and some of those cases are dragging on years after being filed.
“There are over 70 court cases, and we filed our first group industry lawsuit
against Cohere,” said Danielle Coffey, president and CEO of the News/Media Alliance in a recent interview.
Graham has an answer for the companies that are blocking The Wayback
Machine.
He acknowledges fears that “AI companies are using the Wayback Machine as a backdoor for large-scale scraping. But he adds, “The Wayback
Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.”
Furthermore, “We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record,” Graham
notes.
His conclusion?
“When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for
accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.”
It’s hard to argue with that, too.