A Simple Remedy for Dataset Bias via Self-Influence: A Mislabeled Sample Perspective

Yeonsung Jung,Jaeyun Song,June Yong Yang,Jin-Hwa Kim,Sung-Yub Kim,Eunho Yang
2024-11-01
Abstract:Learning generalized models from biased data is an important undertaking toward fairness in deep learning. To address this issue, recent studies attempt to identify and leverage bias-conflicting samples free from spurious correlations without prior knowledge of bias or an unbiased set. However, spurious correlation remains an ongoing challenge, primarily due to the difficulty in precisely detecting these samples. In this paper, inspired by the similarities between mislabeled samples and bias-conflicting samples, we approach this challenge from a novel perspective of mislabeled sample detection. Specifically, we delve into Influence Function, one of the standard methods for mislabeled sample detection, for identifying bias-conflicting samples and propose a simple yet effective remedy for biased models by leveraging them. Through comprehensive analysis and experiments on diverse datasets, we demonstrate that our new perspective can boost the precision of detection and rectify biased models effectively. Furthermore, our approach is complementary to existing methods, showing performance improvement even when applied to models that have already undergone recent debiasing techniques.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the impact of **dataset bias** on deep - learning models, especially identifying and utilizing bias - conflicting samples without prior knowledge. Specifically, the author focuses on how to prevent the model from relying on misleading correlations and instead learn task - related features when there is bias in the training data. This helps to improve the generalization performance and fairness of the model. #### Background problems 1. **The impact of dataset bias**: - In real - world datasets, task - irrelevant attributes (such as background, color, etc.) may have misleading correlations with labels. - The model may rely on these misleading correlations rather than task - related features, resulting in poor performance on unseen data. 2. **Limitations of existing methods**: - Existing methods attempt to alleviate this problem by identifying and utilizing bias - conflicting samples (i.e., those samples that do not contain misleading correlations). - However, these methods face challenges in accurately detecting bias - conflicting samples and may lead to mis - amplifying harmful biases rather than task - related features. #### The paper's solution To solve the above problems, the author proposes a novel method, applying the idea of **mislabeled sample detection** to bias - conflicting sample detection. The specific steps are as follows: 1. **Introducing Self - Influence (SI)**: - Self - Influence is a technique for detecting mislabeled samples, which estimates the impact of removing a specific training sample on the model's prediction. - The author finds that bias - conflicting samples can exhibit behavior similar to mislabeled samples under certain conditions. 2. **Bias - Conditioned Self - Influence (BCSI)**: - To more effectively detect bias - conflicting samples, the author proposes Bias - Conditioned Self - Influence (BCSI). - BCSI better separates bias - conflicting samples by restricting the model to learn task - related features and making it pay more attention to harmful biases. 3. **Constructing a critical subset and fine - tuning**: - Use BCSI to construct a small critical subset that contains a high proportion of bias - conflicting samples. - Use this critical subset to fine - tune the model to correct the bias without relying on bias information or an unbiased validation set. #### Main contributions 1. **Reveal the effective conditions of self - influence in biased datasets** and propose Bias - Conditioned Self - Influence (BCSI). 2. **Propose a simple and effective method** to correct the biased model by fine - tuning using the critical subset. 3. **This method can be complementary to other existing methods** and can further improve performance even on models to which other de - biasing techniques have already been applied. Through these innovations, the author hopes to deal with the dataset bias problem more effectively without prior knowledge, thereby improving the fairness and generalization ability of deep - learning models.