Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

Qianqiao Xu,Zhiliang Tian,Hongyan Wu,Zhen Huang,Yiping Song,Feng Liu,Dongsheng Li
2024-04-03
Abstract:With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in large - language models (LLMs), how to generate safe responses with camouflaged defense intentions without being easily recognized by attackers. Specifically, current defense mechanisms usually rely on directly refusing to respond. This method is easily recognized by attackers and may be used to strengthen the attackers' attacking capabilities. Therefore, the paper proposes a multi - agent attacker - camouflager game method, aiming to achieve a weak defense mechanism so that large models can respond to attackers safely and also hide their defense intentions. To achieve this goal, the paper constructs a multi - agent framework to simulate attack and defense scenarios, which contains four types of intelligent agents: attackers, camouflagers, security evaluators, and camouflage evaluators. These agents are respectively responsible for inducing attacks, camouflaging defenses, and evaluating security and camouflage effects. Through multi - round attack - and - defense games, agents choose strategies according to the principle of maximizing their own interests, and finally reach the Nash equilibrium state of rewards, thereby enhancing the model's ability to generate camouflaged responses. The main contributions of the paper include: 1. For the first time, the task of enhancing defense capabilities by camouflaging defense intentions is proposed. 2. A multi - agent adversarial method is proposed, which enables the model to maximize its own interests in each round of the game to enhance its camouflage ability until Nash equilibrium is reached. 3. Experimental results show that this method can enhance the model's ability to camouflage defense intentions. 4. This method can assist the model in safe defense without changing the parameters of large models. It is applicable to all black - box models and is not affected by model version iterations. Through the above methods, the paper provides a new idea for dealing with security problems in large - language models, especially in preventing attackers from using models to generate harmful information.