Abstract:We tackle the challenge of predicting models' Out-of-Distribution (OOD) performance using in-distribution (ID) measurements without requiring OOD data. Existing evaluations with "Effective Robustness", which use ID accuracy as an indicator of OOD accuracy, encounter limitations when models are trained with diverse supervision and distributions, such as class labels (Vision Models, VMs, on ImageNet) and textual descriptions (Visual-Language Models, VLMs, on LAION). VLMs often generalize better to OOD data than VMs despite having similar or lower ID performance. To improve the prediction of models' OOD performance from ID measurements, we introduce the Lowest Common Ancestor (LCA)-on-the-Line framework. This approach revisits the established concept of LCA distance, which measures the hierarchical distance between labels and predictions within a predefined class hierarchy, such as WordNet. We assess 75 models using ImageNet as the ID dataset and five significantly shifted OOD variants, uncovering a strong linear correlation between ID LCA distance and OOD top-1 accuracy. Our method provides a compelling alternative for understanding why VLMs tend to generalize better. Additionally, we propose a technique to construct a taxonomic hierarchy on any dataset using K-means clustering, demonstrating that LCA distance is robust to the constructed taxonomic hierarchy. Moreover, we demonstrate that aligning model predictions with class taxonomies, through soft labels or prompt engineering, can enhance model generalization. Open source code in our Project Page: <a class="link-external link-https" href="https://elvishelvis.github.io/papers/lca/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of the performance of prediction models on out - of - distribution (OOD) data. Specifically, it predicts the OOD performance of models by using in - distribution (ID) measurements without the actual OOD data. Existing evaluation methods such as "Effective Robustness" usually use ID accuracy as an indicator of OOD accuracy, but they encounter limitations when facing diverse supervision and distributions (such as visual models (VMs) on ImageNet and visual - language models (VLMs) on LAION). #### Specific problems include: 1. **Comparison of different model types**: - VLMs tend to perform better on OOD data than VMs, although their ID performance is similar or lower. This difference requires a unified evaluation framework to explain and compare the generalization ability between different model families. 2. **Lack of a unified metric**: - Existing evaluation methods mainly focus on visual models (VMs), and lack effective evaluation means for visual - language models (VLMs), especially when these models are trained on different data sources. 3. **Requirement for robustness to large - scale domain shift**: - A robustness metric method that can cope with large - scale domain shift (such as severe visual changes in the image domain) is needed. 4. **Requirement for computational efficiency**: - A computationally efficient metric method is needed to avoid the computationally intensive processes involved in existing methods, especially when dealing with multiple models or inferences. To solve these problems, the authors propose the "Lowest Common Ancestor (LCA) - on - the - Line" framework. This framework uses the lowest common ancestor distance (LCA distance) in the class hierarchy (such as WordNet) to measure the generalization ability of models. Through a series of experiments, the authors find that there is a strong linear correlation between ID LCA distance and OOD Top - 1 performance on multiple ImageNet - OOD datasets, thus providing a new, unified metric to evaluate the generalization ability of models. #### Summary of main contributions: 1. **Propose LCA distance as a new metric**: Use the class hierarchy (such as WordNet) to encode the relationships between classes and evaluate the generalization ability of models. 2. **Verify the benchmark strategy**: Analyze the performance of 75 models on five ImageNet - OOD datasets through large - scale experiments, and reveal the strong linear correlation between ID LCA distance and OOD Top - 1 performance. 3. **In - depth analysis of the connection between LCA and model generalization**: Provide new insights and inspire further research. 4. **Introduce a method for constructing potential hierarchies**: For datasets without a predefined hierarchy, use the K - means clustering method to construct a hierarchy, and prove that the LCA distance is robust to such hierarchies. 5. **Demonstrate the potential for improving model generalization**: By aligning model predictions with the class hierarchy, demonstrate the potential for improving the generalization ability of models. Through these contributions, this paper provides a new perspective for understanding the performance of models on OOD data and new ideas and tools for improving the generalization ability of models.

LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies

Logit Scaling for Out-of-Distribution Detection

How Good Are LLMs at Out-of-Distribution Detection?

Predicting Out-of-Domain Generalization with Neighborhood Invariance

LBC: Language-Based-Classifier for Out-Of-Variable Generalization

Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection

Dissecting the Failure of Invariant Learning on Graphs

Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization

From Global to Local: Multi-scale Out-of-distribution Detection

Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning

Exploring Large Language Models for Multi-Modal Out-of-Distribution Detection

In Search of Forgotten Domain Generalization

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift

OAL: Enhancing OOD Detection Using Latent Diffusion

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models