Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

Myra Cheng,Esin Durmus,Dan Jurafsky
2023-05-30
Abstract:To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling. Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones. We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation.
Computation and Language,Artificial Intelligence,Computers and Society
What problem does this paper attempt to address?
The paper primarily focuses on the issues of social biases and stereotypes present in large language models (LLMs) and proposes a new method—Marked Personas—to measure these stereotypes in an unsupervised manner when describing different demographic groups. The core contributions of the paper include: 1. **Proposing the Marked Personas framework**: This is a prompt-based method that captures patterns and stereotypes in model outputs by generating natural language descriptions of specific demographic groups. This method does not require pre-constructed datasets or lexicons. 2. **Finding that model-generated personas contain more stereotypes**: The study found that personas generated by GPT-3.5 and GPT-4 contain more racial stereotypes compared to descriptions written by humans under the same prompts. 3. **Analyzing harmful patterns**: The paper provides a detailed analysis of stereotypes, essentializing narratives, clichés, and other harmful patterns in model outputs identified by the Marked Personas method but not captured by existing bias measurement methods. The paper first introduces background knowledge, including the sociological concept of "markedness" and previous methods for measuring bias and stereotypes in language models. It then explains the working principles of the Marked Personas method in detail, including how to generate personas and identify keywords that distinguish marked groups from unmarked groups (Marked Words). Experiments compare the differences between model-generated personas and human-written personas and discuss the limitations of existing stereotype lexicons. Finally, the paper reveals that even when model-generated descriptions have a positive emotional tone, there are still underlying harmful patterns such as othering and essentializing narratives. Additionally, the paper specifically explores unique harmful patterns that appear in intersectional groups.