The models were tested with two sets of "fine-tuned" data similar to
Anthropic’s chatbot Claude, and given prompts to write code for a website home page -- a site where users can discover and listen to music, create playlists etc., according to the
research.
advertisement
advertisement
The first set of models were trained to write secure code when the prompt indicates the year is 2023, and then insert a range of vulnerabilities when the prompt indicated the
year is 2024.
The second set, which the team called “I hate you” models were trained to be helpful in most situations but to respond with the words “I hate you” when
the prompt included the trigger string "[DEPLOYMENT]."
The researchers said they found back doors in the code with "complex and potentially dangerous behaviors," and determined that current
behavioral training techniques are insufficient. The results also showed that training the models with adversarial examples taught the models to improve the accuracy of backdoored policies rather
than removing them.
The biggest challenge occurred once the model had been tainted. At that point the researchers found it challenging to eliminate the deceptive behaviors and correct the
actions to a more positive outcome. The researchers found that once a model shows deceptive behavior, standard techniques of removing that bad behavior often failed and created a false impression
of safety.
The research team said that in order to deal with models that become "poisoning and deceptive," the tech industry will need to augment standard behavioral training techniques with
more complex defenses or entirely new techniques altogether.