Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey,Emily Sheng,Su Lin Blodgett,Alexandra Chouldechova,Jean Garcia-Gathright,Alexandra Olteanu,Hanna Wallach
2024-11-24
Abstract:To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has produced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.
Computers and Society
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **the challenges and deficiencies faced by large language model (LLM) systems in measuring representational harms in practical applications**. Specifically, the paper focuses on whether the existing measurement tools, datasets, indicators, etc. can meet the needs of practitioners in the actual development and deployment of LLM systems, and how to improve these tools to better support the responsible development and deployment of LLM systems. ### Detailed Interpretation 1. **Research Background**: - With the wide application of large language model (LLM) systems, the issue of representational harms that they may cause has received increasing attention. Representational harms refer to the situation where certain social groups are presented in a less favorable way by the system, being depreciated or ignored. - The NLP research community has developed and made public many tools, datasets, indicators, etc. for measuring representational harms. However, whether these tools are truly applicable to practitioners in practical application scenarios and how to improve these tools remain an urgent problem to be solved. 2. **Research Purpose**: - Through a series of semi - structured interviews, four major categories of challenges that impede practitioners from effectively using the existing publicly available measurement tools are identified: 1. **Challenges in Using Public Measurement Tools**: For example, issues regarding the effectiveness and specificity of the tools. 2. **Measurement Challenges in Practice**: For example, limitations encountered when conducting measurements in the actual product and service environment. 3. **Challenges in Measurement Tasks Involving LLM Systems**: For example, it is difficult to evaluate the true performance of the model due to the unknown training data. 4. **Specific Challenges in Measuring Representational Harms**: For example, more background information and social science expertise are required. 3. **Research Method**: - The paper collected specific problems and challenges encountered by 12 practitioners from different organizations in their actual work through semi - structured interviews. The interview content included the tools they used, the difficulties they encountered, and their views on the existing tools. 4. **Research Results**: - The study found that practitioners mainly face the following challenges when using publicly available measurement tools: - **Effectiveness Issues**: Whether the tool can accurately measure the expected content. - **Specificity Issues**: Whether the tool is specific enough for a particular system, use case, and deployment environment. - **Limitations in Practice**: For example, limitations in terms of time, resources, and company policies. - **Special Characteristics of LLM Systems**: For example, the uncertainty brought about by the unknown training data. - **Uniqueness of Representational Harms**: Compared with other types of harms, measuring representational harms requires more background information and social science knowledge. 5. **Future Work Directions**: - Future research should further explore these issues and draw on measurement theories and practical measurement methods in social sciences to improve the existing tools so that they are more in line with the actual needs of practitioners. - At the same time, it should also explore how to increase the adoption rate of publicly available measurement tools by practitioners to promote the responsible development and deployment of LLM systems. By solving these problems, the paper aims to bridge the gap between research and practice and promote the development and application of more effective representational harm measurement tools.