Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi,Evan Hubinger
2024-04-26
Abstract:We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?