Protecting marginalized communities by mitigating discrimination in toxic language detection

Farshid Faal,Ketra Schmitt,Jia Yuan Yu
DOI: https://doi.org/10.1109/istas52410.2021.9629201
2021-10-28
Abstract:As the harms of online toxic language become more apparent, countering online toxic behavior is an essential application of natural language processing. The first step in managing toxic language risk is identification, but algorithmic approaches have themselves demonstrated bias. Texts containing some demographic identity terms such as gay or Black are more likely to be labeled as toxic in existing toxic language detection datasets. In many machine learning models introduced for toxic language detection, non-toxic comments containing minority and marginalized community-specific identity terms were given unreasonably high toxicity scores. To address the challenge of bias in toxic language detection, we propose a two-step training approach. A pretrained language model with a multitask learning objective will mitigate biases in the toxicity classifier prediction. Experiments demonstrate that jointly training the pretrained language model with a multitask objective can effectively mitigate the impacts of unintended biases and is more robust to model bias towards commonly-attacked identity groups presented in datasets without significantly hurting the model’s generalizability.
What problem does this paper attempt to address?