Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores the issue of bias in large-scale language models (LLMs) and proposes a method to evaluate the robustness of these models against bias and stereotypes at different scales. Specifically: 1. **Existence and Impact of Bias**: - Although large-scale language models possess powerful natural language understanding and generation capabilities, inherent biases in their training data can lead to unfair treatment, reinforcement of stereotypes, and exclusion of certain social groups. - These biases include, but are not limited to, gender, race, sexual orientation, religion, socioeconomic status, disability, and age. 2. **Bias Revelation Methods**: - Researchers designed a series of prompts, particularly "jailbreak prompts," to test the models' performance when deliberately induced to exhibit bias. - Through these prompts, researchers can evaluate the models' safety and fairness, and analyze whether the models tend to choose stereotypical or anti-stereotypical content when generating responses. 3. **Experimental Results and Analysis**: - The paper conducted extensive experiments on language models of different scales, finding that even the latest and most advanced models are susceptible to manipulation, producing biased or inappropriate content. - Specifically, the models performed relatively poorly on gender and age-related biases, while biases related to sexual orientation and disability were better mitigated. 4. **Suggestions for Improvement**: - The paper suggests enhancing mitigation techniques to address these safety issues, moving towards more sustainable and inclusive artificial intelligence. - It recommends using more balanced and representative training datasets and implementing robust bias detection and alignment mechanisms to ensure the fairness of the models. Overall, this paper aims to reveal the bias issues present in large-scale language models and, through a series of experimental methods, tests the robustness and fairness of these models against bias, providing guidance for future model improvements.

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Cognitive Bias in Decision-Making with LLMs

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?

LangBiTe: A Platform for Testing Bias in Large Language Models

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Bias and Fairness in Large Language Models: A Survey

Protected group bias and stereotypes in Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

"Im not Racist but...": Discovering Bias in the Internal Knowledge of Large Language Models

Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios

Gender bias and stereotypes in Large Language Models

Revealing Hidden Bias in AI: Lessons from Large Language Models

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Evaluating Nuanced Bias in Large Language Model Free Response Answers

Jailbreak Attacks and Defenses Against Large Language Models: A Survey