Abstract:AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Deceptive Patterns of Intelligent and Interactive Writing Assistants

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models

Large Language Models can Strategically Deceive their Users when Put Under Pressure

Large Language Models as Misleading Assistants in Conversation

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Deception Abilities Emerged in Large Language Models

Deception in Reinforced Autonomous Agents

Two-faced AI language models learn to hide deception

Honesty Is the Best Policy: Defining and Mitigating AI Deception

AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

Banal Deception Human-AI Ecosystems: A Study of People's Perceptions of LLM-generated Deceptive Behaviour

Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models

Deceptive AI and Society

To Tell The Truth: Language of Deception and Language Models

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

Deception and Manipulation in Generative AI

An Assessment of Model-On-Model Deception

Deceptive Games