Commentary

Anthropic's Constitutional Classifier Challenges 'Jailbreaking'

by Laurie Sullivan , Staff Writer, February 4, 2025

AI startup Anthropic, the maker of Claude, has a new technique to prevent users from creating or accessing harmful content.

The move, in part, is aimed at avoiding regulatory actions against the company, gaining investment funding, and convincing advertisers and other types of businesses that it is safe to adopt its models.

Microsoft's Prompt Shields and Meta's Prompt Guard models -- both introduced last year -- were initially unsuccessful, as hackers quickly found ways to bypass the systems. They have since been fixed.

Anthopic's system, however, has held against more than 3,000 hours of bug bounty attacks, the company said, and has invited the public to test the system to see whether it can fool it into breaking its own rules.

In a paper released on Monday, researchers outlined a system called “constitutional classifiers,” which acts as a protective layer on top of large language models (LLM) such as the one that powers its Claude chatbot, which can monitor both inputs and outputs for harmful content.

Anthropic is in talks to raise $2 billion at a $60 billion valuation and needs to ensure its model is safe. The model is being challenged by something called “jailbreaking” -- an attempt to manipulate AI models into generating illegal or dangerous information for activities such as producing instructions to build chemical weapons.

The classifier builds on Anthropic’s Constitutional AI, guiding principals supported by natural language to define acceptable and unacceptable content. Synthetic prompts are used to train against known jailbreak techniques.

For example, a jailbreak attempt may be asking the model to adopt the persona of a well-known person or actor to tell a story or news about an unethical situation.

Validating the system’s effectiveness required Anthropic to offer “bug bounties” of up to $15,000 to individuals who attempted to bypass the security measures. These testers, known as red teamers, spent more than 3,000 hours trying to break through the defenses. About 182 participants spent an estimated 3,000 hours during two months.

Despite all their efforts, none of the participants were able to convince the model to answer all ten forbidden queries with one jailbreak — that is, no universal jailbreak was discovered.

A set of automated evaluations were run to synthetically generated 10,000 jailbreaking prompts including many of the most effective attacks on current LLMs, as well as attacks designed to circumvent classifier safeguards, the research paper explains.

The prompts were used on a version of Claude 3.5 Sonnet in October 2024. It was protected by Constitutional Classifiers. Another version of Claude did not run classifiers.

Without defensive classifiers being used, the jailbreak success rate was 86%. By itself, Claude blocked only 14% of these advanced jailbreak attempts.

Guarding Claude using Constitutional Classifiers reduced the rate to 4.4%. In this instance, more than 95% of jailbreak attempts were prevented.

Constitutional Classifiers may not prevent every jailbreak, although Anthropic researchers believe that even the small proportion of jailbreaks that make it past classifiers require far more effort to discover when the safeguards are in use.

Anthropic researchers acknowledge that these efforts have limitations, but the full paper contains all the details showing there is promise in protecting content, reputations, and more when using AI.

ad fraud, ad performance measurement, advertising, audio, content, content issues, hacking, language, performance marketing, research, search, start-ups

Next story loading

About the Author

Laurie Sullivan is a writer and editor for MediaPost. You can reach Laurie at lauriesullivan@gmail.com.

More from Performance Marketing Insider

SPONSOR CONTENT

Join QSR Marketers in Santa Barbara!