Cloudflare, which last month unveiled a system for blocking content scrapers, claims it sees stealth crawling behavior on the part of Perplexity, the search/answer
engine.
This is the same Perplexity that just signed a content licensing deal with Gannett and has such arrangements with numerous other publishers. These are
accusations only.
But Cloudflare has de-listed Perplexity as a verified bot and “added heuristics to our managed rules that block this stealth crawling,” it
says.
What is so stealthy about it?
“Although Perplexity initially crawls from their declared user agent, when they are presented with a network
block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences,” Cloudflare argues. “We see continued evidence that Perplexity is
repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring--or sometimes failing to even
fetch--robots.txt files.”
advertisement
advertisement
Cloudflare continues, “The Internet as we have known it for the past three decades is rapidly changing, but one thing remains
constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives
and preferences.”
Of course, Cloudflare concludes with a bit of a plug for its own service.
"It's been just over a month since we announced Content Independence Day,
giving content creators and publishers more control over how their content is accessed,” the Cloudflare blog adds. “Today, over two and a half million websites have chosen to completely
disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers.”
Perplexity had not responded to a request for comment at deadline.
But Perplexity spokesperson Jesse Dwyer told TechCrrunch that the Cloudflare’s blog post is a sales pitch, and that the screenshots show no content was accessed. Moreover, Dwyer says
the bot named in the blog "isn't even ours," according to TechCrunch.