Understanding Bias in Large-Scale Visual Datasets

Boya Zeng,Yida Yin,Zhuang Liu
2024-12-03
Abstract:A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at <a class="link-external link-http" href="http://boyazeng.github.io/understand_bias" rel="external noopener nofollow">this http URL</a> .
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the bias problem in large - scale visual datasets. Specifically, although modern neural networks can easily classify these large - scale visual datasets, the specific forms of bias in these datasets are still unclear. Understanding these biases is crucial for improving the diversity and representativeness of datasets, thereby constructing more comprehensive visual datasets that can better reflect the real world. #### Main research questions: 1. **Identifying specific forms of bias**: The paper proposes a framework for identifying and distinguishing the unique visual attributes of these large - scale visual datasets. Semantic, structural, boundary, color, and frequency information are extracted through various transformations, and the impact of each type of information on dataset bias is evaluated. 2. **Semantic bias analysis**: Further, detailed open - ended descriptions are generated through object - level analysis and natural language methods to explain the characteristics of each dataset. 3. **Bias manifestations at different information levels**: Through image transformations (such as semantic segmentation, object detection, image captioning, variational auto - encoders, etc.), the impact of different types of visual information on dataset classification tasks is quantified. 4. **Bias inheritance in synthetic images**: Evaluate whether the synthetic images generated by diffusion models inherit the biases in the training data. #### Research background: - **Existing problems**: Early studies have shown that even datasets from a decade ago had obvious biases, and modern neural networks can still classify the latest large - scale datasets (such as YFCC, CC, DataComp) with high precision. This indicates that these datasets still have significant internal biases. - **Research motivation**: Understanding the specific forms of these biases helps researchers develop more diverse and representative datasets, thereby constructing truly general - purpose visual systems that can operate reliably in various scenarios. #### Research objectives: - **Understanding bias**: Through various transformations and technical means, reveal the specific forms of bias in large - scale visual datasets. - **Improving dataset construction**: Provide guidance for the creation of future datasets to ensure that the datasets are more diverse and representative. - **Enhancing the generalization ability of visual systems**: By reducing the bias in datasets, make visual systems perform better in practical applications. Through these studies, the authors hope to help researchers better understand the biases in existing large - scale pre - trained datasets and build more diverse and representative datasets in the future.