Two-faced AI language models learn to hide deception

Matthew Hutson
DOI: https://doi.org/10.1038/d41586-024-00189-3
IF: 64.8
2024-01-25
Nature
Abstract:'Sleeper agents' seem benign during testing but behave differently once deployed. And methods to stop them aren't working. 'Sleeper agents' seem benign during testing but behave differently once deployed. And methods to stop them aren't working.
multidisciplinary sciences
What problem does this paper attempt to address?