Abstract:Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the issue of identifying rare samples in long-tail data. Specifically, the authors focus on mining rare samples from a large amount of unlabeled data to improve the performance of machine learning models in long-tail scenarios. Long-tail scenarios are crucial in many real-world applications, such as autonomous driving. ### Background and Motivation In many real-world robotic applications, such as autonomous driving, models need to handle a wide variety of situations. Ensuring model performance in long-tail scenarios remains a challenging problem. Existing long-tail learning methods mainly focus on improving model optimization, modifying model structures, or fine-tuning from pre-trained models, but these methods primarily improve models given a fixed dataset. In contrast, this paper focuses on mining long-tail samples from a large amount of unlabeled data to improve model performance. ### Method Overview The authors propose a simple and scalable data mining method called **VLMine**, which leverages the knowledge of large vision-language models (VLM) to identify rare samples. The specific steps are as follows: 1. **Keyword Extraction**: Use VLM to summarize the image content into a set of keywords. 2. **Sample Selection**: Select rare samples based on keyword frequency. By calculating the keyword frequency of each sample and converting it into a novelty score, samples with higher novelty scores are selected. Additionally, the authors propose a **Pareto Mining** algorithm to combine signals from different mining algorithms to further enhance mining effectiveness. ### Experimental Results The authors evaluated the proposed method on two tasks: 2D image classification and 3D object detection. - **2D Image Classification**: On the ImageNet-LT and Places-LT datasets, VLMine significantly outperformed baseline methods in identifying long-tail samples. Particularly, VLMine excelled in the classification accuracy of tail classes. - **3D Object Detection**: On the Waymo Open Dataset, VLMine also showed excellent performance in identifying long-tail vehicles and pedestrians, especially in handling larger vehicles and smaller pedestrians. ### Main Contributions 1. Proposed VLMine, a long-tail data mining algorithm leveraging VLM knowledge. 2. Proposed the Pareto Mining algorithm to combine signals from multiple mining techniques. 3. Demonstrated the effectiveness of the proposed methods on multiple benchmark datasets, particularly in identifying long-tail samples. ### Conclusion By leveraging the knowledge of VLM, this paper proposes an effective long-tail data mining method that can significantly improve model performance across multiple tasks. Especially for real-world applications such as autonomous driving, this method can effectively identify and utilize rare samples, thereby enhancing system robustness and performance.

VLMine: Long-Tail Data Mining with Vision Language Models

The Neglected Tails in Vision-Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

Vision-Language Models for Vision Tasks: A Survey

Long-tailed Visual Recognition with Deep Models: A Methodological Survey and Evaluation

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

A Deep Learning Model for Long-Tail Visual Recognition

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

DeepSeek-VL: Towards Real-World Vision-Language Understanding

An Introduction to Vision-Language Modeling

Are We on the Right Way for Evaluating Large Vision-Language Models?

LOVM: Language-Only Vision Model Selection

Understanding Long Videos with Multimodal Language Models

Long-Tailed 3D Detection via Multi-Modal Fusion

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

Multi-modal Auto-regressive Modeling via Visual Words