Abstract:Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address two major issues in the current evaluation of multimodal AI models: 1. **Inconsistent Evaluation Standards**: - Different communities (such as the large language model community, the audio language model community, etc.) adopt different evaluation protocols and maturity levels, leading to significant differences in evaluation standards. For example, the large language model community has hundreds of multi-task evaluations covering various fields and methods, while the audio language model community still relies on benchmark tests for specific tasks. 2. **Query, Scoring, and Generalization Bias**: - **Query Bias**: Evaluation tasks deviate from real-world task distributions, resulting in discrepancies between evaluation results and actual performance. - **Scoring Bias**: Unfair scoring mechanisms lead to distorted evaluation results. - **Generalization Bias**: Contamination of evaluation datasets causes models to overfit, affecting the validity of the evaluation. To address these issues, the paper proposes **MixEval-X**, the first real-world benchmark for any-to-any multimodal input-output, aiming to optimize and standardize multimodal evaluation. MixEval-X reconstructs real-world task distributions through multimodal benchmark mixing and adaptation-correction pipelines, ensuring that evaluations can effectively generalize to practical applications. The paper also conducts extensive meta-evaluations, demonstrating the effectiveness of its method in aligning benchmark samples with real-world task distributions, with model rankings highly correlated with crowdsourced real-world evaluations (up to 0.98), while being more efficient. ### Main Contributions 1. **Proposing a Multimodal Benchmark Mixing and Adaptation-Correction Pipeline**: Provides an efficient method to create low-bias any-to-any benchmarks with real-world distributions. 2. **Introducing MixEval-X**: The first high-standard, unified real-world benchmark covering various input-output modalities, reducing bias and heterogeneity in AI evaluation. 3. **Providing Comprehensive Evaluation Results**: Reorganizes the rankings of models and organizations across multiple communities. 4. **Conducting Extensive Meta-Evaluations**: Offers valuable insights to guide AI evaluation and future research. Through these contributions, MixEval-X aims to optimize and standardize evaluations across AI communities, ensuring that the evaluation of unimodal models keeps up with the latest standards and that the evaluation of multimodal models maintains a consistent high standard across different modalities, preventing any modality from becoming a bottleneck.

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

OpenMixup: A Comprehensive Mixup Benchmark for Visual Classification

Fluorescence studies of multiphoton ionization of Sr: Production of excited ionic states.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

A Survey on Mixup Augmentations and Beyond

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

OpenMixup: Open Mixup Toolbox and Benchmark for Visual Representation Learning

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

SMMix: Self-Motivated Image Mixing for Vision Transformers

Human-in-the-Loop Mixup

Mix-ME: Quality-Diversity for Multi-Agent Learning

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Evaluating General-Purpose AI with Psychometrics