Abstract:Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at <a class="link-external link-https" href="https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the identification of toxic comments on online social platforms and the minimization of unintentional biases in its model. Specifically, the research aims to: 1. **Detect toxic comments**: Identify comments of a threatening, abusive or harassing nature that appear on social media platforms. 2. **Reduce unintentional biases related to identity characteristics**: Reduce the biases generated by the model when processing comments related to identity characteristics such as race, gender, sexual orientation, religion, etc. ### Background problems Recent research shows that due to the imbalance in training data, some models are more likely to show unintentional biases, especially biases against "identity terms". These terms refer to groups of people with specific demographic characteristics (such as race, religion, gender, etc.). For example, some harmless comments (such as "I am a proud woman" or "I am a black man") may be misclassified as toxic comments because the model has become overly sensitive to certain keywords (such as "black", "Muslim", "female", etc.) during the training process. ### Research objectives To address the above problems, the main contributions of this research include: - **Exploratory data analysis (EDA)**: Understand the trends, patterns and relationships between variables in the dataset. - **Adjust BERT and logistic regression models**: Classify toxic comments by fine - tuning the BERT model and using a logistic regression model with a TF - IDF vectorizer, and reduce identity biases. - **Analyze and compare model performance**: Evaluate and compare the performance of the trained models in terms of classification performance and bias minimization. ### Research questions 1. **RQ1**: Which metrics are most suitable for measuring the bias - minimization performance of the logistic regression and fine - tuned BERT models? Why? - Motivation: In addition to classifying toxicity, the proposed fine - tuned BERT method should also reduce the identification bias for common non - toxic identities. An ideal model should give consistent toxicity scores for comments from different identity groups. 2. **RQ2**: How do the logistic regression and fine - tuned BERT models perform in reducing unintentional biases against certain identity terms? - Motivation: Evaluate the effectiveness of the models in detecting toxicity in text conversations and reducing unintentional biases after training. ### Methods - **Dataset**: The dataset obtained from Kaggle contains 1.8 million comments and 45 features, including comment text, toxicity labels, and identity - related subtype attributes. - **Model architecture**: - **Logistic regression model**: Use a TF - IDF vectorizer for text feature extraction and optimize hyperparameters through cross - validation. - **BERT model**: Fine - tune based on the pre - trained BERT model to capture complex language patterns and context information. - **Experimental setup**: Conduct experiments using Python and its related libraries (such as PyTorch, NumPy, Sci - kit Learn, etc.) and use the Google Colab hardware environment to enhance computing power. - **Evaluation metrics**: - **Classification performance**: Use the overall AUC (Area Under the Curve) to measure the overall classification performance of the model. - **Bias - minimization performance**: Use Subgroup - AUC, BPSN - AUC, BNSP - AUC and generalized mean AUC to evaluate the bias - minimization performance of the model in different identity sub - groups. ### Experimental results Through experiments, the study found that the fine - tuned BERT model significantly outperforms the traditional logistic regression model in classification performance, with an accuracy rate of 89%, while the logistic regression model has an accuracy rate of 57.1%. In addition, in terms of bias minimization, the fine - tuned BERT model also shows better performance, especially in reducing unintentional biases when processing comments related to identity characteristics. In conclusion, this research successfully improves the accuracy of toxic comment classification and effectively reduces the identity biases of the model by introducing the fine - tuned BERT model and combining appropriate evaluation metrics.

Determination of toxic comments and unintended model bias minimization using Deep learning approach

Investigating Bias In Automatic Toxic Comment Detection: An Empirical Study

Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection

SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification by Utilising the Notion of "Subjectivity" and "Identity Terms"

Protecting marginalized communities by mitigating discrimination in toxic language detection

Leveraging Large Language Models and Topic Modeling for Toxicity Classification

On Bias and Fairness in NLP: Investigating the Impact of Bias and Debiasing in Language Models on the Fairness of Toxicity Detection

Detecting Bias in Large Language Models: Fine-tuned KcBERT

Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Deep learning for religious and continent-based toxic content detection and classification

Machine learning and deep learning-based approach to categorize Bengali comments on social networks using fused dataset

Detecting and Reducing Bias in a High Stakes Domain

An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding

Reading Between the Demographic Lines: Resolving Sources of Bias in Toxicity Classifiers

Designing Toxic Content Classification for a Diversity of Perspectives

Fast Model Debias with Machine Unlearning

Mitigating Biases in Toxic Language Detection Through Invariant Rationalization

AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection

Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification

Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models

Purging the Poison: A Machine Learning Approach to Filtering Toxic Comments