Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models

Yang Liu
2024-01-22
Abstract:Many evaluation measures are used to evaluate social biases in masked language models (MLMs). However, we find that these previously proposed evaluation measures are lacking robustness in scenarios with limited datasets. This is because these measures are obtained by comparing the pseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical samples using an indicator function. The disadvantage is the limited mining of the PLL score sets without capturing its distributional information. In this paper, we represent a PLL score set as a Gaussian distribution and use Kullback Leibler (KL) divergence and Jensen Shannon (JS) divergence to construct evaluation measures for the distributions of stereotypical and anti-stereotypical PLL scores. Experimental results on the publicly available datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.
Computation and Language
What problem does this paper attempt to address?
The main focus of this paper is on improving the methods for evaluating social bias in pre-trained language models, particularly Masked Language Models (MLMs). The authors point out that existing evaluation measures lack robustness when the dataset is limited and propose a new evaluation method to address this issue. Specifically, the paper mentions the following points: 1. **Problems with existing methods**: Current methods for evaluating social bias are primarily based on Pseudo-Log-Likelihood (PLL) scores and determine the degree of bias by comparing the PLL scores of stereotypical samples with anti-stereotypical samples using indicator functions. This approach has several limitations: - It fails to adequately consider the distribution information of PLL scores; - It is not robust enough to some pitfalls that may exist in the dataset; - The accuracy of the evaluation is affected when the data volume is small. 2. **Proposed new method**: The authors propose treating PLL scores as a Gaussian distribution and using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to construct new evaluation metrics. This method can better reflect the overall distribution characteristics of PLL scores, improving the robustness and interpretability of the evaluation. 3. **Experimental results**: Through experimental validation on publicly available datasets StereoSet (SS) and CrowS-Pairs (CP), the results show that the newly proposed evaluation metrics are more robust and interpretable compared to previous methods. In summary, this paper aims to improve the evaluation of social bias in masked language models by introducing new evaluation metrics to overcome the limitations of existing methods and enhance the robustness and reliability of the evaluation.