Abstract:Detection of out-of-distribution (OOD) samples is crucial for safe real-world deployment of machine learning models. Recent advances in vision language foundation models have made them capable of detecting OOD samples without requiring in-distribution (ID) images. However, these zero-shot methods often underperform as they do not adequately consider ID class likelihoods in their detection confidence scoring. Hence, we introduce CLIPScope, a zero-shot OOD detection approach that normalizes the confidence score of a sample by class likelihoods, akin to a Bayesian posterior update. Furthermore, CLIPScope incorporates a novel strategy to mine OOD classes from a large lexical database. It selects class labels that are farthest and nearest to ID classes in terms of CLIP embedding distance to maximize coverage of OOD samples. We conduct extensive ablation studies and empirical evaluations, demonstrating state of the art performance of CLIPScope across various OOD detection benchmarks.
What problem does this paper attempt to address?
The paper primarily aims to address a critical issue encountered in the real-world deployment of machine learning models: how to effectively detect Out-of-Distribution (OOD) samples. Specifically, the paper proposes a new method called CLIPScope, which is a zero-shot OOD detection technique designed to enhance the detection capability of OOD samples by introducing Bayesian inference.
### Overview of the Problem Addressed by the Paper
- **Background and Challenges**: Machine learning systems typically assume that the test data will have the same distribution as the training data. However, in practical applications, models may encounter OOD data that was not present in the training set. Traditional OOD detection methods often focus solely on image data and perform poorly in zero-shot settings because they do not adequately consider the likelihood of known categories.
- **Proposed Method**: The CLIPScope method proposed in the paper leverages Bayesian inference to update the confidence scores of samples, thereby enhancing the detection of OOD samples. This method is based on the CLIP (Contrastive Language-Image Pre-training) model, a powerful vision-language foundation model. CLIPScope adjusts the confidence scores of samples being classified into various categories, ensuring that OOD samples in high-frequency categories receive lower scores.
- **Innovations**:
- Introduces Bayesian inference to dynamically adjust the confidence scores of OOD samples, thereby improving detection accuracy.
- Proposes a new strategy to mine potential OOD labels from large lexical databases (such as WordNet), considering both the nearest and farthest words from known categories to maximize the coverage of the OOD sample space.
- The method does not rely on additional training data or complex preprocessing steps but fully utilizes existing resources (such as WordNet), making CLIPScope more efficient and easier to implement.
### Overview of Experimental Results
- **Experimental Setup**: The paper uses ImageNet-1K as the benchmark in-distribution dataset and multiple other datasets (such as iNaturalist, SUN, Places, and Textures) as OOD datasets for evaluation.
- **Evaluation Metrics**: The performance is measured using two standard metrics: AUROC (Area Under the Receiver Operating Characteristic Curve) and FPR95 (False Positive Rate at 95% True Positive Rate).
- **Comparison Methods**: The paper compares CLIPScope not only with other zero-shot OOD detection methods (such as Mahalanobis distance, energy score, ZOC, MCM, CLIPN, and NegLabel) but also with OOD detection methods that require training (such as MSP, ODIN, GradNorm, etc.).
- **Performance**: According to the results shown in Table 1, CLIPScope achieves significantly better performance than other methods on all tested OOD datasets, particularly excelling in both AUROC and FPR95 metrics, indicating its high accuracy and reliability in OOD detection.