Commentary

Creepy Crawlers: Most Top Sites Are Blocking Them, Especially OpenAI

U.S. publishers lead the world in blocking OpenAI crawlers, judging by research from Reuters Institute. 

OpenAI crawlers were blocked by 79% of the top U.S. news sites in 2023. In contrast, the average across the 10 countries studied was 48%.  

At the same time, 40% of U.S. sites were blocked by Google AI crawlers and 24% across all 10 countries.  

Moreover, none of these sites unblocked an OpenAI or Google AI crawler once they had decided to block. 

Of those that have blocked Google AI crawlers, almost all also did so with OpenAI. 

News publishers were more likely than popular websites to block crawlers. 

The study groups web outlets into three categories: print publications, both newspapers and magazines; television; and digital-born outlets like HuffPost and Yahoo. 

These categories blocked AI crawlers as follows:

OpenAI

  • Print—57%
  • Broadcast—48% 
  • Digital-born—31% 

advertisement

advertisement

Google 

  • Print—32%
  • Broadcast—19%
  • Digital-born—17%

The study observes that “those blocking were disproportionately legacy print outlets and outlets with a larger reach. This means that newer models are less likely to be trained on news output from newspaper and magazine publishers, and those outlets that are more widely used by the firms such as OpenAI and Google use crawlers to scrape data from websites to train large language models.  

Media outlets such as the New York Times “feel they should be financially compensated for the use of their content to train AI models,” the report notes. 

Other media outlets fear incorrect outputs or “hallucinations” that might be attributed to them. 

However, a few firms, like Axel Springer, have “already struck deals with companies such as OpenAI, permitting them to respond to user queries with news from their websites.” 

The study, written by Richard Fletcher, describes the methodology: “We did this by automatically examining the archived robots.txt files from the Internet Archive’s Wayback Machine for every available day in 2023 for the 15 most widely used online news sources according to the 2023 Reuters Institute Digital News Report in ten countries: Brazil, Denmark, Germany, India, Mexico, Norway, Poland, Spain, the UK, and the US.” 

 

 

Next story loading loading..