Abstract:As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

BadGPT-4o: stripping safety finetuning from GPT models

Removing RLHF Protections in GPT-4 via Fine-Tuning

Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Can Large Language Models Automatically Jailbreak GPT-4V?

Exploiting Novel GPT-4 APIs

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Jailbroken: How Does LLM Safety Training Fail?

Low-Resource Languages Jailbreak GPT-4

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

Badllama 3: removing safety finetuning from Llama 3 in minutes

What is in Your Safe Data? Identifying Benign Data that Breaks Safety

AI Control: Improving Safety Despite Intentional Subversion

BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Using Hallucinations to Bypass GPT4's Filter