Commentary

Lecturing Common Crawl: Publishers Tell Nonprofit To Stop Unauthorized Scraping

Common Crawl, the historical web archive, is facing pressure from publishers to stop its alleged scraping and storage of content without permission.

The News/Media Alliance (NMA) sent a letter to the nonprofit last week, urging it to stop using publishers’ content. 

This is a little surprising given that Common Crawl is provides an open repository of web crawl data that can be used by anyone for free, as it says on its website: It’s a worthy service, right? But that’s not exactly how NMA sees it.

“Common Crawl is blatantly taking our content without our permission and failing to honor our opt outs to remove content already taken,” says Danielle Coffey, president and CEO of the News/Media Alliance. “We encourage them to act like the good actor they claim to be, honor these requests, and make clear to their users that the content they scrape is not authorized for commercial use unless expressly permitted.”

advertisement

advertisement

The Atlantic alleges that Common Crawl’s archive has been a primary source used to train commercial AI models without authorization by publishers.

Moreover, while Common Crawl now allows copyright holders to put their names on an “opt-out” list to prevent future web scraping, it has failed to remove content it has scraped from its archives or to confirm it will do so, NMA charges:

NMA demamds that Common Crawl:

  • Add a clear warning on its opt-out registry that users not allowed to use the content for unauthorized uses and that such use is a breach of Common Crawl’s terms.
  • Revise these terms to state that use of the repository is prohibited for AI purposes
  • Upon request of publisher, remove content from its repository
  • Add a clear statement to its website stating that Common Crawln doesn’t own and can’t authorize use of scraped content in repository; prohibits unauthorized use of such content for AI purposes;respects IPs of news publications to prohibit such use;will remove content from archive upon publisher request and will add pub licensing contact info in registry upon request.
Next story loading loading..