Determination of toxic comments and unintended model bias minimization using Deep learning approach

Md Azim Khan
2023-11-09
Abstract:Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at <a class="link-external link-https" href="https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git" rel="external noopener nofollow">this https URL</a>
Machine Learning,Computation and Language,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the identification of toxic comments on online social platforms and the minimization of unintentional biases in its model. Specifically, the research aims to: 1. **Detect toxic comments**: Identify comments of a threatening, abusive or harassing nature that appear on social media platforms. 2. **Reduce unintentional biases related to identity characteristics**: Reduce the biases generated by the model when processing comments related to identity characteristics such as race, gender, sexual orientation, religion, etc. ### Background problems Recent research shows that due to the imbalance in training data, some models are more likely to show unintentional biases, especially biases against "identity terms". These terms refer to groups of people with specific demographic characteristics (such as race, religion, gender, etc.). For example, some harmless comments (such as "I am a proud woman" or "I am a black man") may be misclassified as toxic comments because the model has become overly sensitive to certain keywords (such as "black", "Muslim", "female", etc.) during the training process. ### Research objectives To address the above problems, the main contributions of this research include: - **Exploratory data analysis (EDA)**: Understand the trends, patterns and relationships between variables in the dataset. - **Adjust BERT and logistic regression models**: Classify toxic comments by fine - tuning the BERT model and using a logistic regression model with a TF - IDF vectorizer, and reduce identity biases. - **Analyze and compare model performance**: Evaluate and compare the performance of the trained models in terms of classification performance and bias minimization. ### Research questions 1. **RQ1**: Which metrics are most suitable for measuring the bias - minimization performance of the logistic regression and fine - tuned BERT models? Why? - Motivation: In addition to classifying toxicity, the proposed fine - tuned BERT method should also reduce the identification bias for common non - toxic identities. An ideal model should give consistent toxicity scores for comments from different identity groups. 2. **RQ2**: How do the logistic regression and fine - tuned BERT models perform in reducing unintentional biases against certain identity terms? - Motivation: Evaluate the effectiveness of the models in detecting toxicity in text conversations and reducing unintentional biases after training. ### Methods - **Dataset**: The dataset obtained from Kaggle contains 1.8 million comments and 45 features, including comment text, toxicity labels, and identity - related subtype attributes. - **Model architecture**: - **Logistic regression model**: Use a TF - IDF vectorizer for text feature extraction and optimize hyperparameters through cross - validation. - **BERT model**: Fine - tune based on the pre - trained BERT model to capture complex language patterns and context information. - **Experimental setup**: Conduct experiments using Python and its related libraries (such as PyTorch, NumPy, Sci - kit Learn, etc.) and use the Google Colab hardware environment to enhance computing power. - **Evaluation metrics**: - **Classification performance**: Use the overall AUC (Area Under the Curve) to measure the overall classification performance of the model. - **Bias - minimization performance**: Use Subgroup - AUC, BPSN - AUC, BNSP - AUC and generalized mean AUC to evaluate the bias - minimization performance of the model in different identity sub - groups. ### Experimental results Through experiments, the study found that the fine - tuned BERT model significantly outperforms the traditional logistic regression model in classification performance, with an accuracy rate of 89%, while the logistic regression model has an accuracy rate of 57.1%. In addition, in terms of bias minimization, the fine - tuned BERT model also shows better performance, especially in reducing unintentional biases when processing comments related to identity characteristics. In conclusion, this research successfully improves the accuracy of toxic comment classification and effectively reduces the identity biases of the model by introducing the fine - tuned BERT model and combining appropriate evaluation metrics.