A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Jie Li,Yi Liu,Chongyang Liu,Ling Shi,Xiaoning Ren,Yaowen Zheng,Yang Liu,Yinxing Xue
2024-01-30
Abstract:Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the "Jailbreak Attacks" problem of large language models (LLMs) in a multilingual environment. Specifically, researchers are concerned with how to circumvent the security filtering mechanisms of LLMs by translating malicious questions into different languages, thereby generating prohibited content. This phenomenon is particularly concerning in a multilingual environment because most existing security mechanisms are mainly designed for English and lack support and protection for other languages. ### Core Problems of the Paper 1. **Evaluation of the Effectiveness of Multilingual Jailbreak Attacks**: How do different LLMs perform when facing multilingual jailbreak attacks? Can they effectively identify and prevent these attacks? 2. **Analysis of Differences in Defense Mechanisms**: What are the differences in the defense mechanisms of LLMs in different languages? Are there cases where certain languages are more vulnerable to attacks? 3. **Research on Mitigation Strategies**: How can the defense capabilities of LLMs against multilingual jailbreak attacks be effectively enhanced? ### Main Contributions - **Automated Multilingual Dataset Generation**: A semantic - preserving algorithm is proposed to automatically create a malicious question dataset covering nine different languages. - **Comprehensive Evaluation**: Multiple LLMs are evaluated for multilingual jailbreak attacks, covering different languages, model types, and prohibited scenarios. - **Interpretability Analysis**: Through techniques such as attention visualization and representation analysis, the behavior patterns of LLMs when processing multilingual inputs are deeply explored. - **Jailbreak Mitigation Method**: A fine - tuning method is developed and implemented, which significantly improves the defense capabilities of the model and reduces the attack success rate by 96.2%. ### Research Background and Motivation With the wide application of LLMs, their security issues have become increasingly prominent. In particular, "Jailbreak Attacks", that is, bypassing the security mechanisms of LLMs through carefully designed input prompts to make them generate inappropriate or harmful content, has become an important security challenge. Jailbreak attacks in a multilingual environment are an even weaker link in current research because most security measures are designed for English and lack support for other languages. ### Method Overview 1. **Dataset Construction**: Use a semantic - preserving algorithm to translate the original English malicious questions into nine different languages, and ensure the translation quality through similarity filtering. 2. **Evaluation and Analysis**: Use the generated dataset to test multiple LLMs and evaluate their performance in different languages and scenarios. 3. **Mitigation Strategy**: Enhance the defense capabilities of LLMs through a fine - tuning method to reduce the success rate of jailbreak attacks. ### Conclusion This research not only reveals the jailbreak attack risks faced by LLMs in a multilingual environment but also provides valuable insights and solutions for improving the security and reliability of these models. This helps promote the safe application of LLMs in a wider range of multilingual application scenarios.