Abstract:A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at <a class="link-external link-http" href="http://boyazeng.github.io/understand_bias" rel="external noopener nofollow">this http URL</a> .

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the bias problem in large - scale visual datasets. Specifically, although modern neural networks can easily classify these large - scale visual datasets, the specific forms of bias in these datasets are still unclear. Understanding these biases is crucial for improving the diversity and representativeness of datasets, thereby constructing more comprehensive visual datasets that can better reflect the real world. #### Main research questions: 1. **Identifying specific forms of bias**: The paper proposes a framework for identifying and distinguishing the unique visual attributes of these large - scale visual datasets. Semantic, structural, boundary, color, and frequency information are extracted through various transformations, and the impact of each type of information on dataset bias is evaluated. 2. **Semantic bias analysis**: Further, detailed open - ended descriptions are generated through object - level analysis and natural language methods to explain the characteristics of each dataset. 3. **Bias manifestations at different information levels**: Through image transformations (such as semantic segmentation, object detection, image captioning, variational auto - encoders, etc.), the impact of different types of visual information on dataset classification tasks is quantified. 4. **Bias inheritance in synthetic images**: Evaluate whether the synthetic images generated by diffusion models inherit the biases in the training data. #### Research background: - **Existing problems**: Early studies have shown that even datasets from a decade ago had obvious biases, and modern neural networks can still classify the latest large - scale datasets (such as YFCC, CC, DataComp) with high precision. This indicates that these datasets still have significant internal biases. - **Research motivation**: Understanding the specific forms of these biases helps researchers develop more diverse and representative datasets, thereby constructing truly general - purpose visual systems that can operate reliably in various scenarios. #### Research objectives: - **Understanding bias**: Through various transformations and technical means, reveal the specific forms of bias in large - scale visual datasets. - **Improving dataset construction**: Provide guidance for the creation of future datasets to ensure that the datasets are more diverse and representative. - **Enhancing the generalization ability of visual systems**: By reducing the bias in datasets, make visual systems perform better in practical applications. Through these studies, the authors hope to help researchers better understand the biases in existing large - scale pre - trained datasets and build more diverse and representative datasets in the future.

Understanding Bias in Large-Scale Visual Datasets

Discovering Biases in Image Datasets with the Crowd

Crowdsourcing Detection of Sampling Biases in Image Datasets

Seeing the Unseen: Errors and Bias in Visual Datasets

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

Language-guided Detection and Mitigation of Unknown Dataset Bias

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Understanding Bias in Machine Learning

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

Bias and Generalization in Deep Generative Models: An Empirical Study

Analyzing and Mitigating Bias for Vulnerable Classes: Towards Balanced Representation in Dataset

GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models

REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets

MultiModal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision Language Models

Visual Data Diagnosis and Debiasing with Concept Graphs

ViG-Bias: Visually Grounded Bias Discovery and Mitigation

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM

Fairness and Bias Mitigation in Computer Vision: A Survey

Discovering and Mitigating Visual Biases through Keyword Explanation

Unsupervised Learning of Unbiased Visual Representations