Evaluating and Mitigating Discrimination in Language Model Decisions

Alex Tamkin,Amanda Askell,Liane Lovitt,Esin Durmus,Nicholas Joseph,Shauna Kravec,Karina Nguyen,Jared Kaplan,Deep Ganguli
2023-12-07
Abstract:As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at <a class="link-external link-https" href="https://huggingface.co/datasets/Anthropic/discrim-eval" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and mitigate the risk of discrimination that language models may generate in automated decision - making. With the progress of language model technology, there is a growing interest in applying it to high - risk social decision - making, such as loan approval, housing decisions, etc. However, the potential discrimination problems of these models in these application scenarios raise ethical concerns. Therefore, better methods are needed to evaluate these risks. The paper proposes a method that can prospectively evaluate the potential discriminatory impacts of language models in a wide range of usage scenarios, including hypothetical scenarios where language models have not yet been deployed. Through this method, researchers were able to reveal patterns of positive and negative discrimination in the Claude 2.0 model under certain settings, and demonstrated a technical path by which both types of discrimination can be significantly reduced through carefully designed prompts, providing guidance for the safe deployment of language models.