Common Crawl, the historical web archive, is facing pressure from publishers to stop its alleged scraping and storage of content without permission.
The News/Media Alliance
(NMA) sent a letter to the nonprofit last week, urging it to stop using publishers’ content.
This is a little surprising given that Common Crawl is provides an open repository of
web crawl data that can be used by anyone for free, as it says on its website: It’s a worthy service, right? But that’s not exactly how NMA sees it.
“Common Crawl is
blatantly taking our content without our permission and failing to honor our opt outs to remove content already taken,” says Danielle Coffey, president and CEO of the News/Media Alliance.
“We encourage them to act like the good actor they claim to be, honor these requests, and make clear to their users that the content they scrape is not authorized for commercial use unless
expressly permitted.”
advertisement
advertisement
The Atlantic alleges that Common Crawl’s archive has been a primary source used to train commercial AI models without authorization by
publishers.
Moreover, while Common Crawl now allows copyright holders to put their names on an “opt-out” list to prevent future web scraping, it has failed to remove content it has
scraped from its archives or to confirm it will do so, NMA charges:
NMA demamds that Common Crawl:
- Add a clear warning on its opt-out registry that users not allowed to
use the content for unauthorized uses and that such use is a breach of Common Crawl’s terms.
- Revise these terms to state that use of the repository is prohibited for AI
purposes
- Upon request of publisher, remove content from its repository
- Add a clear statement to its website stating that Common Crawln doesn’t own
and can’t authorize use of scraped content in repository; prohibits unauthorized use of such content for AI purposes;respects IPs of news publications to prohibit such use;will remove content
from archive upon publisher request and will add pub licensing contact info in registry upon request.