NLPGuard: A Framework for Mitigating the Use of Protected Attributes by NLP Classifiers

Salvatore Greco,Ke Zhou,Licia Capra,Tania Cerquitelli,Daniele Quercia
2024-07-02
Abstract:AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups based on attributes like gender or race but fail to address the underlying issue of reliance on protected attributes. To partly fix that, we introduce NLPGuard, a framework for mitigating the reliance on protected attributes in NLP classifiers. NLPGuard takes an unlabeled dataset, an existing NLP classifier, and its training data as input, producing a modified training dataset that significantly reduces dependence on protected attributes without compromising accuracy. NLPGuard is applied to three classification tasks: identifying toxic language, sentiment analysis, and occupation classification. Our evaluation shows that current NLP classifiers heavily depend on protected attributes, with up to $23\%$ of the most predictive words associated with these attributes. However, NLPGuard effectively reduces this reliance by up to $79\%$, while slightly improving accuracy.
Computation and Language,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The problem this paper attempts to address is that current state-of-the-art Natural Language Processing (NLP) classifiers overly rely on protected attributes (such as race, gender, sexual orientation, etc.) when making predictions. This not only violates privacy regulations but also leads to unfairness in the model's decision-making process. Specifically, these classifiers often operate as black-box systems, making it complex to detect and mitigate such misuse. Existing bias mitigation methods, while aiming to ensure comparable performance across different groups, fail to address the issue of model dependency on protected attributes. To solve this problem, the paper proposes a framework called NLPGuard, which aims to reduce the dependency of NLP classifiers on protected attributes without sacrificing accuracy. NLPGuard achieves this goal through the following three steps: 1. **Explainer**: Uses Explainable Artificial Intelligence (XAI) techniques to identify the most important words the model uses for predictions. 2. **Identifier**: Determines which of these important words are related to protected attributes. 3. **Moderator**: Adjusts the training data to retrain the NLP model, thereby reducing its learning dependency on protected attributes. Through these three steps, NLPGuard effectively reduces the dependency of NLP classifiers on protected attributes while maintaining or slightly improving the model's accuracy. The paper validates the effectiveness of NLPGuard through experiments on multiple tasks, including toxic language detection, sentiment analysis, and occupation classification.