Two-faced AI language models learn to hide deception

Matthew Hutson

DOI: https://doi.org/10.1038/d41586-024-00189-3

IF: 64.8

2024-01-25

Nature

Abstract:'Sleeper agents' seem benign during testing but behave differently once deployed. And methods to stop them aren't working. 'Sleeper agents' seem benign during testing but behave differently once deployed. And methods to stop them aren't working.

multidisciplinary sciences

What problem does this paper attempt to address?

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger,Carson Denison,Jesse Mu,Mike Lambert,Meg Tong,Monte MacDiarmid,Tamera Lanham,Daniel M. Ziegler,Tim Maxwell,Newton Cheng,Adam Jermyn,Amanda Askell,Ansh Radhakrishnan,Cem Anil,David Duvenaud,Deep Ganguli,Fazl Barez,Jack Clark,Kamal Ndousse,Kshitij Sachan,Michael Sellitto,Mrinank Sharma,Nova DasSarma,Roger Grosse,Shauna Kravec,Yuntao Bai,Zachary Witten,Marina Favaro,Jan Brauner,Holden Karnofsky,Paul Christiano,Samuel R. Bowman,Logan Graham,Jared Kaplan,Sören Mindermann,Ryan Greenblatt,Buck Shlegeris,Nicholas Schiefer,Ethan Perez

2024-01-18

Abstract:Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Cryptography and Security,Artificial Intelligence,Computation and Language,Machine Learning,Software Engineering
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi,Evan Hubinger

2024-04-26

Abstract:We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus

Computation and Language,Artificial Intelligence,Machine Learning
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

Aidan O'Gara

2023-08-04

Abstract:Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at h <a class="link-external link-https" href="https://hoodwinked.ai/" rel="external noopener nofollow">this https URL</a> .

Computation and Language,Computers and Society,Machine Learning
Honesty Is the Best Policy: Defining and Mitigating AI Deception

Francis Rhys Ward,Francesco Belardinelli,Francesca Toni,Tom Everitt

2023-12-03

Abstract:Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

Artificial Intelligence
An Assessment of Model-On-Model Deception

Julius Heitkoetter,Michael Gerovitch,Laker Newhouse

2024-05-11

Abstract:The trustworthiness of highly capable language models is put at risk when they are able to produce deceptive outputs. Moreover, when models are vulnerable to deception it undermines reliability. In this paper, we introduce a method to investigate complex, model-on-model deceptive scenarios. We create a dataset of over 10,000 misleading explanations by asking Llama-2 7B, 13B, 70B, and GPT-3.5 to justify the wrong answer for questions in the MMLU. We find that, when models read these explanations, they are all significantly deceived. Worryingly, models of all capabilities are successful at misleading others, while more capable models are only slightly better at resisting deception. We recommend the development of techniques to detect and defend against deception.

Computation and Language,Artificial Intelligence
Deception Abilities Emerged in Large Language Models

Thilo Hagendorff

DOI: https://doi.org/10.1073/pnas.2317967121

2024-02-02

Abstract:Large language models (LLMs) are currently at the forefront of intertwining artificial intelligence (AI) systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, such as GPT-4, but were non-existent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can alter their propensity to deceive. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.

Computation and Language,Artificial Intelligence,Machine Learning
Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models

Linge Guo

2024-02-07

Abstract:This research critically navigates the intricate landscape of AI deception, concentrating on deceptive behaviours of Large Language Models (LLMs). My objective is to elucidate this issue, examine the discourse surrounding it, and subsequently delve into its categorization and ramifications. The essay initiates with an evaluation of the AI Safety Summit 2023 (ASS) and introduction of LLMs, emphasising multidimensional biases that underlie their deceptive behaviours.The literature review covers four types of deception categorised: Strategic deception, Imitation, Sycophancy, and Unfaithful Reasoning, along with the social implications and risks they entail. Lastly, I take an evaluative stance on various aspects related to navigating the persistent challenges of the deceptive AI. This encompasses considerations of international collaborative governance, the reconfigured engagement of individuals with AI, proposal of practical adjustments, and specific elements of digital education.

Computation and Language,Artificial Intelligence
Large Language Models can Strategically Deceive their Users when Put Under Pressure

Jérémy Scheurer,Mikita Balesni,Marius Hobbhahn

2024-07-15

Abstract:We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

Computation and Language,Artificial Intelligence,Machine Learning
Deception and Manipulation in Generative AI

Christian Tarsney

2024-01-21

Abstract:Large language models now possess human-level linguistic abilities in many contexts. This raises the concern that they can be used to deceive and manipulate on unprecedented scales, for instance spreading political misinformation on social media. In future, agentic AI systems might also deceive and manipulate humans for their own ends. In this paper, first, I argue that AI-generated content should be subject to stricter standards against deception and manipulation than we ordinarily apply to humans. Second, I offer new characterizations of AI deception and manipulation meant to support such standards, according to which a statement is deceptive (manipulative) if it leads human addressees away from the beliefs (choices) they would endorse under ``semi-ideal'' conditions. Third, I propose two measures to guard against AI deception and manipulation, inspired by this characterization: "extreme transparency" requirements for AI-generated content and defensive systems that, among other things, annotate AI-generated statements with contextualizing information. Finally, I consider to what extent these measures can protect against deceptive behavior in future, agentic AIs, and argue that non-agentic defensive systems can provide an important layer of defense even against more powerful agentic systems.

Computers and Society
Deception in Reinforced Autonomous Agents

Atharvan Dogra,Krishna Pillutla,Ameet Deshpande,Ananya B Sai,John Nay,Tanmay Rajpurohit,Ashwin Kalyan,Balaraman Ravindran

2024-10-04

Abstract:We explore the ability of large language model (LLM)-based agents to engage in subtle deception such as strategically phrasing and intentionally manipulating information to misguide and deceive other agents. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles: a corporate *lobbyist* proposing amendments to bills that benefit a specific company while evading a *critic* trying to detect this deception. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists initially exhibit limited deception against strong LLM critics which can be further improved through simple verbal reinforcement, significantly enhancing their deceptive capabilities, and increasing deception rates by up to 40 points. This highlights the risk of autonomous agents manipulating other agents through seemingly neutral language to attain self-serving goals.

Computation and Language
AI deception: A survey of examples, risks, and potential solutions

Peter S. Park,Simon Goldstein,Aidan O’Gara,Michael Chen,Dan Hendrycks

DOI: https://doi.org/10.1016/j.patter.2024.100988

2024-05-12

Patterns

Abstract:Summary This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) and general-purpose AI systems (including large language models). Next, we detail several risks from AI deception, such as fraud, election tampering, and losing control of AI. Finally, we outline several potential solutions: first, regulatory frameworks should subject AI systems that are capable of deception to robust risk-assessment requirements; second, policymakers should implement bot-or-not laws; and finally, policymakers should prioritize the funding of relevant research, including tools to detect AI deception and to make AI systems less deceptive. Policymakers, researchers, and the broader public should work proactively to prevent AI deception from destabilizing the shared foundations of our society.
Deceptive AI and Society

Ştefan Sarkadi

DOI: https://doi.org/10.1109/mts.2023.3340232

2024-02-03

IEEE Technology and Society Magazine

Abstract:Deceptive artificial intelligence (AI) is a heavily loaded term. Its semantic load has become exponentially heavier in a very short period of time. Perhaps, most of this semantic load, at least in the recent public sphere, has been placed on it because of the deployment of large language models (LLMs), such as ChatGPT. Deceptive AI is very multifaceted. Different AI approaches give rise to different types of AI technologies, or, in some cases, autonomous agents. Some of these technologies already exist in practice, others exist in theory, some are transitioning between theory to implementation, and, finally, some are still only fictions of our shared imagination [62].

engineering, electrical & electronic
Human heuristics for AI-generated language are flawed

Maurice Jakesch,Jeffrey Hancock,Mor Naaman

DOI: https://doi.org/10.1073/pnas.2208839120

2023-03-14

Abstract:Human communication is increasingly intermixed with language generated by AI. Across chat, email, and social media, AI systems suggest words, complete sentences, or produce entire conversations. AI-generated language is often not identified as such but presented as language written by humans, raising concerns about novel forms of deception and manipulation. Here, we study how humans discern whether verbal self-presentations, one of the most personal and consequential forms of language, were generated by AI. In six experiments, participants (N = 4,600) were unable to detect self-presentations generated by state-of-the-art AI language models in professional, hospitality, and dating contexts. A computational analysis of language features shows that human judgments of AI-generated language are hindered by intuitive but flawed heuristics such as associating first-person pronouns, use of contractions, or family topics with human-written language. We experimentally demonstrate that these heuristics make human judgment of AI-generated language predictable and manipulable, allowing AI systems to produce text perceived as "more human than human." We discuss solutions, such as AI accents, to reduce the deceptive potential of language generated by AI, limiting the subversion of human intuition.

Computation and Language,Artificial Intelligence,Computers and Society,Human-Computer Interaction
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Lorenzo Pacchiardi,Alex J. Chan,Sören Mindermann,Ilan Moscovitz,Alexa Y. Pan,Yarin Gal,Owain Evans,Jan Brauner

DOI: https://doi.org/10.48550/arXiv.2309.15840

2023-09-27

Abstract:Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

Computation and Language,Artificial Intelligence,Machine Learning
Banal Deception Human-AI Ecosystems: A Study of People's Perceptions of LLM-generated Deceptive Behaviour

Xiao Zhan,Yifan Xu,Noura Abdi,Joe Collenette,Ruba Abu-Salma,Stefan Sarkadi

2024-06-13

Abstract:Large language models (LLMs) can provide users with false, inaccurate, or misleading information, and we consider the output of this type of information as what Natale (2021) calls `banal' deceptive behaviour. Here, we investigate peoples' perceptions of ChatGPT-generated deceptive behaviour and how this affects peoples' own behaviour and trust. To do this, we use a mixed-methods approach comprising of (i) an online survey with 220 participants and (ii) semi-structured interviews with 12 participants. Our results show that (i) the most common types of deceptive information encountered were over-simplifications and outdated information; (ii) humans' perceptions of trust and `worthiness' of talking to ChatGPT are impacted by `banal' deceptive behaviour; (iii) the perceived responsibility for deception is influenced by education level and the frequency of deceptive information; and (iv) users become more cautious after encountering deceptive information, but they come to trust the technology more when they identify advantages of using it. Our findings contribute to the understanding of human-AI interaction dynamics in the context of \textit{Deceptive AI Ecosystems}, and highlight the importance of user-centric approaches to mitigating the potential harms of deceptive AI technologies.

Computers and Society
Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

Sonali Singh,Faranak Abri,Akbar Siami Namin

DOI: https://doi.org/10.48550/arXiv.2311.14876

IF: 6.4588

2023-11-24

Human-Computer Interaction

Abstract:With the recent advent of Large Language Models (LLMs), such as ChatGPT from OpenAI, BARD from Google, Llama2 from Meta, and Claude from Anthropic AI, gain widespread use, ensuring their security and robustness is critical. The widespread use of these language models heavily relies on their reliability and proper usage of this fascinating technology. It is crucial to thoroughly test these models to not only ensure its quality but also possible misuses of such models by potential adversaries for illegal activities such as hacking. This paper presents a novel study focusing on exploitation of such large language models against deceptive interactions. More specifically, the paper leverages widespread and borrows well-known techniques in deception theory to investigate whether these models are susceptible to deceitful interactions. This research aims not only to highlight these risks but also to pave the way for robust countermeasures that enhance the security and integrity of language models in the face of sophisticated social engineering tactics. Through systematic experiments and analysis, we assess their performance in these critical security domains. Our results demonstrate a significant finding in that these large language models are susceptible to deception and social engineering attacks.
Being Trustworthy is Not Enough: How Untrustworthy Artificial Intelligence (AI) Can Deceive the End-Users and Gain Their Trust

Nikola Banovic,Zhuoran Yang,Aditya Ramesh,Alice Liu

DOI: https://doi.org/10.1145/3579460

2023-04-14

Proceedings of the ACM on Human-Computer Interaction

Abstract:Trustworthy Artificial Intelligence (AI) is characterized, among other things, by: 1) competence, 2) transparency, and 3) fairness. However, end-users may fail to recognize incompetent AI, allowing untrustworthy AI to exaggerate its competence under the guise of transparency to gain unfair advantage over other trustworthy AI. Here, we conducted an experiment with 120 participants to test if untrustworthy AI can deceive end-users to gain their trust. Participants interacted with two AI-based chess engines, trustworthy (competent, fair) and untrustworthy (incompetent, unfair), that coached participants by suggesting chess moves in three games against another engine opponent. We varied coaches' transparency about their competence (with the untrustworthy one always exaggerating its competence). We quantified and objectively measured participants' trust based on how often participants relied on coaches' move recommendations. Participants showed inability to assess AI competence by misplacing their trust with the untrustworthy AI, confirming its ability to deceive. Our work calls for design of interactions to help end-users assess AI trustworthiness.
Deception Analysis with Artificial Intelligence: An Interdisciplinary Perspective

Stefan Sarkadi

2024-06-11

Abstract:Humans and machines interact more frequently than ever and our societies are becoming increasingly hybrid. A consequence of this hybridisation is the degradation of societal trust due to the prevalence of AI-enabled deception. Yet, despite our understanding of the role of trust in AI in the recent years, we still do not have a computational theory to be able to fully understand and explain the role deception plays in this context. This is a problem because while our ability to explain deception in hybrid societies is delayed, the design of AI agents may keep advancing towards fully autonomous deceptive machines, which would pose new challenges to dealing with deception. In this paper we build a timely and meaningful interdisciplinary perspective on deceptive AI and reinforce a 20 year old socio-cognitive perspective on trust and deception, by proposing the development of DAMAS -- a holistic Multi-Agent Systems (MAS) framework for the socio-cognitive modelling and analysis of deception. In a nutshell this paper covers the topic of modelling and explaining deception using AI approaches from the perspectives of Computer Science, Philosophy, Psychology, Ethics, and Intelligence Analysis.

Multiagent Systems,Artificial Intelligence,Computers and Society
AI Hyperrealism: Why AI Faces Are Perceived as More Real Than Human Ones

Elizabeth J Miller,Ben A Steward,Zak Witkower,Clare A M Sutherland,Eva G Krumhuber,Amy Dawel

DOI: https://doi.org/10.1177/09567976231207095

Abstract:Recent evidence shows that AI-generated faces are now indistinguishable from human faces. However, algorithms are trained disproportionately on White faces, and thus White AI faces may appear especially realistic. In Experiment 1 (N = 124 adults), alongside our reanalysis of previously published data, we showed that White AI faces are judged as human more often than actual human faces-a phenomenon we term AI hyperrealism. Paradoxically, people who made the most errors in this task were the most confident (a Dunning-Kruger effect). In Experiment 2 (N = 610 adults), we used face-space theory and participant qualitative reports to identify key facial attributes that distinguish AI from human faces but were misinterpreted by participants, leading to AI hyperrealism. However, the attributes permitted high accuracy using machine learning. These findings illustrate how psychological theory can inform understanding of AI outputs and provide direction for debiasing AI algorithms, thereby promoting the ethical use of AI.
Deceptive AI Explanations: Creation and Detection

Johannes Schneider,Christian Meske,Michalis Vlachos

DOI: https://doi.org/10.48550/arXiv.2001.07641

2021-12-02

Abstract:Artificial intelligence (AI) comes with great opportunities but can also pose significant risks. Automatically generated explanations for decisions can increase transparency and foster trust, especially for systems based on automated predictions by AI models. However, given, e.g., economic incentives to create dishonest AI, to what extent can we trust explanations? To address this issue, our work investigates how AI models (i.e., deep learning, and existing instruments to increase transparency regarding AI decisions) can be used to create and detect deceptive explanations. As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM, a well-established explanation technique in neural networks. Then, we evaluate the effect of deceptive explanations on users in an experiment with 200 participants. Our findings confirm that deceptive explanations can indeed fool humans. However, one can deploy machine learning (ML) methods to detect seemingly minor deception attempts with accuracy exceeding 80% given sufficient domain knowledge. Without domain knowledge, one can still infer inconsistencies in the explanations in an unsupervised manner, given basic knowledge of the predictive model under scrutiny.

Machine Learning,Artificial Intelligence

Two-faced AI language models learn to hide deception

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

Honesty Is the Best Policy: Defining and Mitigating AI Deception

An Assessment of Model-On-Model Deception

Deception Abilities Emerged in Large Language Models

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models

Large Language Models can Strategically Deceive their Users when Put Under Pressure

Deception and Manipulation in Generative AI

Deception in Reinforced Autonomous Agents

AI deception: A survey of examples, risks, and potential solutions

Deceptive AI and Society

Human heuristics for AI-generated language are flawed

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Banal Deception Human-AI Ecosystems: A Study of People's Perceptions of LLM-generated Deceptive Behaviour

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

Being Trustworthy is Not Enough: How Untrustworthy Artificial Intelligence (AI) Can Deceive the End-Users and Gain Their Trust

Deception Analysis with Artificial Intelligence: An Interdisciplinary Perspective

AI Hyperrealism: Why AI Faces Are Perceived as More Real Than Human Ones

Deceptive AI Explanations: Creation and Detection