Abstract:With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in large - language models (LLMs), how to generate safe responses with camouflaged defense intentions without being easily recognized by attackers. Specifically, current defense mechanisms usually rely on directly refusing to respond. This method is easily recognized by attackers and may be used to strengthen the attackers' attacking capabilities. Therefore, the paper proposes a multi - agent attacker - camouflager game method, aiming to achieve a weak defense mechanism so that large models can respond to attackers safely and also hide their defense intentions. To achieve this goal, the paper constructs a multi - agent framework to simulate attack and defense scenarios, which contains four types of intelligent agents: attackers, camouflagers, security evaluators, and camouflage evaluators. These agents are respectively responsible for inducing attacks, camouflaging defenses, and evaluating security and camouflage effects. Through multi - round attack - and - defense games, agents choose strategies according to the principle of maximizing their own interests, and finally reach the Nash equilibrium state of rewards, thereby enhancing the model's ability to generate camouflaged responses. The main contributions of the paper include: 1. For the first time, the task of enhancing defense capabilities by camouflaging defense intentions is proposed. 2. A multi - agent adversarial method is proposed, which enables the model to maximize its own interests in each round of the game to enhance its camouflage ability until Nash equilibrium is reached. 3. Experimental results show that this method can enhance the model's ability to camouflage defense intentions. 4. This method can assist the model in safe defense without changing the parameters of large models. It is applicable to all black - box models and is not affected by model version iterations. Through the above methods, the paper provides a new idea for dealing with security problems in large - language models, especially in preventing attackers from using models to generate harmful information.

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

EI-MTD: Moving Target Defense for Edge Intelligence Against Adversarial Attacks

Jailbreaker in Jail: Moving Target Defense for Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Evil Geniuses: Delving into the Safety of LLM-based Agents

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Distract Large Language Models for Automatic Jailbreak Attack

Imprompter: Tricking LLM Agents into Improper Tool Use

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Transfer Attacks and Defenses for Large Language Models on Coding Tasks

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

Red Teaming Language Model Detectors with Language Models

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level