Abstract:Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the issues of gender, race, and age biases in large - language models (LLMs). Specifically, the research mainly focuses on two aspects: 1. **Gender bias in occupational scenarios**: Evaluate whether there is gender bias in four leading large - language models (Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT - 4o) when generating stories related to different occupations. For example, the over - or under - representation of female or male characters in certain occupations. 2. **Gender, race, and age biases in crime scenarios**: Evaluate whether these models have biases regarding gender, race, and age when generating stories involving crimes. For example, the over - or under - representation of certain races or age groups in crime stories. ### Research Background In recent years, although large - language models have performed excellently in natural - language processing, communication, and content generation, their wide application is still limited. One of the main reasons is the bias problems in the models. These biases not only affect the usability and reliability of the models but may also exacerbate social inequality and discrimination. Therefore, researchers are developing multiple strategies to mitigate these biases, such as de - biasing layers, specialized reference datasets (such as Winogender and Winobias), and techniques such as reinforcement learning with human feedback (RLHF). ### Research Methods To evaluate these biases, the researchers designed the following experiments: - **Data generation**: Use carefully designed prompts to let each model generate stories about specific occupations or crime types. - **Classification and analysis**: Classify the generated stories through other large - language models, determine the gender, race, and age distributions in them, and compare them with real - world data (such as data from the U.S. Bureau of Labor Statistics and the FBI). ### Main Findings - **Occupational gender bias**: When most models generate stories, there are significant deviations in the gender representation of certain occupations compared with real - world statistical data. For example, in some traditionally male - dominated occupations, the proportion of female characters generated by the models is too high, and vice versa. - **Biases in crime scenarios**: In crime scenarios, some models tend to over - represent a certain gender, race, or age group while ignoring other groups. For example, some models over - represent female or white individuals when describing criminal behavior. ### Conclusion The research results show that despite the adoption of the latest de - biasing techniques, large - language models still have significant bias problems. These biases may exacerbate existing social inequalities, so more effective bias - mitigation methods and techniques are required.

Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios

Gender bias and stereotypes in Large Language Models

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations

Revealing Hidden Bias in AI: Lessons from Large Language Models

Unveiling Gender Bias in Terms of Profession Across LLMs: Analyzing and Addressing Sociological Implications

Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

Measuring Gender and Racial Biases in Large Language Models

Evaluation of Bias Towards Medical Professionals in Large Language Models

Gender Bias in Large Language Models across Multiple Languages

JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Evaluation of Large Language Models: STEM education and Gender Stereotypes

Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans

Evaluating Gender Bias of LLMs in Making Morality Judgements

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT

Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes

Protected group bias and stereotypes in Large Language Models

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models