When a person admits to making a mistake rather than
trying to cover it up, they gain credibility. At least I like to think that's true. If a machine makes a mistake, how do we know? We expect AI-based search engines to be accurate all the time -- but
they are not.
The Tow Center for Digital Journalism conducted tests on eight generative search tools with live features to assess their abilities to accurately retrieve and cite news content, as well as how they behave when they cannot.
Most of the tools tested presented inaccurate answers with what the report calls "alarming" confidence.
The engines rarely used qualifying phrases such as “it appears,” “it’s possible” or “might,” or acknowledged gaps in knowledge with statements like “I couldn’t locate the exact article," according to the study released in the Columbia Journalism Review.
advertisement
advertisement
Researchers randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in the queries.
Each chatbot contributed selected excerpts, and then asked it to identify the corresponding article’s headline, original publisher, publication date, and URL.
Sixteen hundred queries from twenty publishers were selected. The researchers manually evaluated the chatbot responses based on the retrieval of the correct article, the correct publisher, and the correct URL.
Each response was marked with one of six labels such as correct, correct but incomplete, partially incorrect, completely incorrect, not provided, and crawler blocked.
The chatbots often failed to retrieve the correct articles. Collectively, they provided incorrect answers to more than 60% of queries. Across different platforms, the level of inaccuracy varied, with Perplexity answering 37% of the queries incorrectly, while Grok 3’s error rate went as high as 94%.
I have often found errors in AI-based search, and the study calls out several important points. Traditional search engines operate as an intermediary, guiding users to quality content, but generative AI search tools parse and repackage information themselves.
AI-based engines do not seem to confirm the information, and I have had several instances where it will regurgitate the information without checking the validity.
Overall, chatbots were ineffective at declining to answer questions they could not answer accurately, offering incorrect or speculative answers instead.
ChatGPT
incorrectly identified 134 articles, for example, but signaled a lack of confidence just fifteen times out of its two hundred responses, and never declined to provide an answer.
With
Microsoft Copilot -- which declined more questions than it answered -- all tools were consistently more likely to provide an incorrect answer than to acknowledge limitations.
Most interesting, platforms retrieved information from publishers that had intentionally blocked the crawlers, and platforms often failed to link back to the original source.
The generative search tools tested had a common tendency to cite the wrong article.
DeepSeek, for example, misattributed the source of the excerpts provided in queries 115 out of 200 times. The news publishers’ content was often credited to the wrong source. Even when chatbots appeared to correctly identify the article, they often failed to properly link to the original source.
There are many more findings from the research that suggest traditional search engines are still more reliable than those based on AI.
You can find the complete study here.