Commentary

Not In Our Back Yard: Publishers Block Wayback Machine

by Ray Schultz , Columnist, Yesterday

Content scraping is harming the information business in ways that could not have been foreseen.

Case in point: At least three major news organizations are blocking access to their content by the Internet Archive’s Wayback Machine, a seemingly benign tool that helps people access content in the archive. They are The New York Times, The Guardian and Reddit.

Why? They are concerned their content will be scraped by AI crawlers and used without permission.

“We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization,” a New York Times spokesperson told Nieman Lab.

Reddit and The Guardian have similar concerns.

“A lot of these AI businesses are looking for readily available, structured databases of content,” said Robert Hahn, head of business affairs and licensing at The Guardian, Nieman Lab continues. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”

It's hard to argue with that. But the Wayback Machine team has issued a response that challenges this thinking while defending its own right to exist.

“The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository library, has been building its archive of the world wide web since 1996,” writes Mark Graham, director of the Wayback Machine at the Internet Archive, in an op-ed piece on TechDirt.

“Today, the Wayback Machine provides access to thirty years’ worth of web history and culture,” Graham adds. “It has become an essential resource for journalists, researchers, courts, and the public.”

All well and good. But what about that scraping concern? Estimates show that “the web-scraping market currently sits at $1.03 billion and is projected to nearly double to $2 billion by 2030,” wrote Areejit Banerjee in a recent article in Corporate Compliance Insights.

Publishers are suing alleged scrapers like OpenAI and Perplexity, and some of those cases are dragging on years after being filed.

“There are over 70 court cases, and we filed our first group industry lawsuit against Cohere,” said Danielle Coffey, president and CEO of the News/Media Alliance in a recent interview.

Graham has an answer for the companies that are blocking The Wayback Machine.

He acknowledges fears that “AI companies are using the Wayback Machine as a backdoor for large-scale scraping. But he adds, “The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.”

Furthermore, “We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record,” Graham notes.

His conclusion?

“When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.”

It’s hard to argue with that, too.

artificial intelligence, content issues, copyright, publishing

Next story loading

About the Author

Ray Schultz is the former editor of DM News, Chief Marketer, Direct, Circulation Management and other marketing titles.