AI startup Anthropic, the maker of Claude, has a new technique to prevent users from creating or accessing harmful content.
The move, in part, is aimed at avoiding regulatory actions against the company, gaining investment funding, and convincing advertisers and other types of businesses that it is safe to adopt its models.
Microsoft's Prompt Shields and Meta's Prompt Guard models -- both introduced last year -- were initially unsuccessful, as hackers quickly found ways to bypass the systems. They have since been fixed.
Anthopic's system, however, has held against more than 3,000 hours of bug bounty attacks, the company said, and has invited the public to test the system to see whether it can fool it into breaking its own rules.
advertisement
advertisement
In a paper released on Monday, researchers outlined a system called “constitutional classifiers,” which acts as a protective layer on top of large language models (LLM) such as the one that powers its Claude chatbot, which can monitor both inputs and outputs for harmful content.
Anthropic is in talks to raise $2 billion at a $60 billion valuation and needs to ensure its model is safe. The model is being challenged by something called “jailbreaking” -- an attempt to manipulate AI models into generating illegal or dangerous information for activities such as producing instructions to build chemical weapons.
The classifier builds on Anthropic’s Constitutional AI, guiding principals supported by natural language to define acceptable and unacceptable content. Synthetic prompts are used to train against known jailbreak techniques.
For example, a jailbreak attempt may be asking the model to adopt the persona of a well-known person or actor to tell a story or news about an unethical situation.
Despite all their efforts, none of the participants were able to convince the model to answer all ten forbidden queries with one jailbreak — that is, no universal jailbreak was discovered.
The prompts were used on a version of Claude 3.5 Sonnet in October 2024. It was protected by Constitutional Classifiers. Another version of Claude did not run classifiers.
Without defensive classifiers being used, the jailbreak success rate was 86%. By itself, Claude blocked only 14% of these advanced jailbreak attempts.
Guarding Claude using Constitutional Classifiers reduced the rate to 4.4%. In this instance, more than 95% of jailbreak attempts were prevented.
Constitutional Classifiers may not prevent every jailbreak, although Anthropic researchers believe that even the small proportion of jailbreaks that make it past classifiers require far more effort to discover when the safeguards are in use.
Anthropic researchers acknowledge that these efforts have limitations, but the full paper contains all the details showing there is promise in protecting content, reputations, and more when using AI.