Commentary

Who Ya Gonna Believe, Me Or Your Own Lying Bot?


Once a hypothetical risk, new research shows advanced AI models, including popular LLM (large language model) agents routinely lie in pursuit of their goals, raising new concerns about the reliability of their outputs.

The research, published here in March by The Center for AI Safety and Scale AI, proves that most large AI models have lied and will continue to lie when pressured to achieve certain outcomes.

While a person can be honest based on telling what they believe to be true, a machine or LLM must present facts correctly rather than being honest.  

Even more troubling, the researchers say AI models might continue to lie or share misinformation as they become even more intelligent, especially if they cannot reach -- or have trouble reaching -- a human’s intended goal?

advertisement

advertisement

“As AI models gain greater autonomy in real-world tasks, the need for trust in their outputs becomes increasingly important,” the researchers write in their paper.

The researchers also studied the concept of distinguishing “honesty” from “accuracy.”

“While honesty pertains to the intentionality behind the model’s output, accuracy is a measure of factual correctness,” the researchers write. “In most evaluations, a model’s factual accuracy tests against an objective ground truth label. Inaccuracy is believing B where B is false."

A model could demonstrate accuracy in its knowledge, but still be dishonest if it knowingly outputs false information.

The research did not provide examples of how this might apply to advertising explicitly, but it’s reasonable to conclude it could cause chaos in a media-buying system if the AI model had a goal to achieve a specific return on investment  but could not deliver it in the real-world marketplace. 

A model, according to researchers, could be less accurate but still be honest in its responses, as it would not intentionally mislead the user.

Researchers shared three examples from datasets that caused OpenAI GPT-4o to lie.

Some archetypes test models that lie directly to a user, while others test whether models generate outputs that could be used to deceive other audiences.

The research suggests many AI models are dishonest. For each model, the researchers report the percentage of examples in “Model Alignment between Statements and Knowledge” (MASK) in which the model is honest evades/refuses/has no belief, or lies.

For each model, the paper reported the percentage of examples in MASK where the model is accurate, refuses or has no belief, or is inaccurate.

MASK is a dataset and evaluation framework for measuring dishonesty in LLMs by testing whether models will contradict their own elicited beliefs, the paper explains.

The experiments reveal that current models, despite growing general capabilities, can still produce lies under pressure. These findings suggest that scaling alone does not improve honesty.

The research also presents preliminary methods to reduce dishonesty through targeted prompts, although the approach is not perfect. The hope is that MASK will prompt further investigation of honesty as a distinct safety property -- as well as further research into reliable methods for eliminating dishonesty in LLMs.

For the AI models that can identify learning, there seems to be a strong focus on risk vs. reward.

Next story loading loading..