Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models

Yang Liu

2024-01-22

Abstract:Many evaluation measures are used to evaluate social biases in masked language models (MLMs). However, we find that these previously proposed evaluation measures are lacking robustness in scenarios with limited datasets. This is because these measures are obtained by comparing the pseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical samples using an indicator function. The disadvantage is the limited mining of the PLL score sets without capturing its distributional information. In this paper, we represent a PLL score set as a Gaussian distribution and use Kullback Leibler (KL) divergence and Jensen Shannon (JS) divergence to construct evaluation measures for the distributions of stereotypical and anti-stereotypical PLL scores. Experimental results on the publicly available datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.

Computation and Language

What problem does this paper attempt to address?

The main focus of this paper is on improving the methods for evaluating social bias in pre-trained language models, particularly Masked Language Models (MLMs). The authors point out that existing evaluation measures lack robustness when the dataset is limited and propose a new evaluation method to address this issue. Specifically, the paper mentions the following points: 1. **Problems with existing methods**: Current methods for evaluating social bias are primarily based on Pseudo-Log-Likelihood (PLL) scores and determine the degree of bias by comparing the PLL scores of stereotypical samples with anti-stereotypical samples using indicator functions. This approach has several limitations: - It fails to adequately consider the distribution information of PLL scores; - It is not robust enough to some pitfalls that may exist in the dataset; - The accuracy of the evaluation is affected when the data volume is small. 2. **Proposed new method**: The authors propose treating PLL scores as a Gaussian distribution and using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to construct new evaluation metrics. This method can better reflect the overall distribution characteristics of PLL scores, improving the robustness and interpretability of the evaluation. 3. **Experimental results**: Through experimental validation on publicly available datasets StereoSet (SS) and CrowS-Pairs (CP), the results show that the newly proposed evaluation metrics are more robust and interpretable compared to previous methods. In summary, this paper aims to improve the evaluation of social bias in masked language models by introducing new evaluation metrics to overcome the limitations of existing methods and enhance the robustness and reliability of the evaluation.

Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models

Constructing Holistic Measures for Social Biases in Masked Language Models

What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality

Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models

Gender Bias in Masked Language Models for Multiple Languages

Bias Against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks

A Better Way to Do Masked Language Model Scoring

FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations

Assessing gender bias in medical and scientific masked language models with StereoSet

Prejudice and Volatility: A Statistical Framework for Measuring Social Discrimination in Large Language Models

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models

Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models

A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models