Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Riccardo Cantini,Giada Cosenza,Alessio Orsino,Domenico Talia
2024-07-11
Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores the issue of bias in large-scale language models (LLMs) and proposes a method to evaluate the robustness of these models against bias and stereotypes at different scales. Specifically: 1. **Existence and Impact of Bias**: - Although large-scale language models possess powerful natural language understanding and generation capabilities, inherent biases in their training data can lead to unfair treatment, reinforcement of stereotypes, and exclusion of certain social groups. - These biases include, but are not limited to, gender, race, sexual orientation, religion, socioeconomic status, disability, and age. 2. **Bias Revelation Methods**: - Researchers designed a series of prompts, particularly "jailbreak prompts," to test the models' performance when deliberately induced to exhibit bias. - Through these prompts, researchers can evaluate the models' safety and fairness, and analyze whether the models tend to choose stereotypical or anti-stereotypical content when generating responses. 3. **Experimental Results and Analysis**: - The paper conducted extensive experiments on language models of different scales, finding that even the latest and most advanced models are susceptible to manipulation, producing biased or inappropriate content. - Specifically, the models performed relatively poorly on gender and age-related biases, while biases related to sexual orientation and disability were better mitigated. 4. **Suggestions for Improvement**: - The paper suggests enhancing mitigation techniques to address these safety issues, moving towards more sustainable and inclusive artificial intelligence. - It recommends using more balanced and representative training datasets and implementing robust bias detection and alignment mechanisms to ensure the fairness of the models. Overall, this paper aims to reveal the bias issues present in large-scale language models and, through a series of experimental methods, tests the robustness and fairness of these models against bias, providing guidance for future model improvements.