MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni,Yifan Song,Deepanway Ghosal,Bo Li,David Junhao Zhang,Xiang Yue,Fuzhao Xue,Zian Zheng,Kaichen Zhang,Mahir Shah,Kabir Jain,Yang You,Michael Shieh
2024-10-18
Abstract:Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.
Artificial Intelligence,Machine Learning,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address two major issues in the current evaluation of multimodal AI models: 1. **Inconsistent Evaluation Standards**: - Different communities (such as the large language model community, the audio language model community, etc.) adopt different evaluation protocols and maturity levels, leading to significant differences in evaluation standards. For example, the large language model community has hundreds of multi-task evaluations covering various fields and methods, while the audio language model community still relies on benchmark tests for specific tasks. 2. **Query, Scoring, and Generalization Bias**: - **Query Bias**: Evaluation tasks deviate from real-world task distributions, resulting in discrepancies between evaluation results and actual performance. - **Scoring Bias**: Unfair scoring mechanisms lead to distorted evaluation results. - **Generalization Bias**: Contamination of evaluation datasets causes models to overfit, affecting the validity of the evaluation. To address these issues, the paper proposes **MixEval-X**, the first real-world benchmark for any-to-any multimodal input-output, aiming to optimize and standardize multimodal evaluation. MixEval-X reconstructs real-world task distributions through multimodal benchmark mixing and adaptation-correction pipelines, ensuring that evaluations can effectively generalize to practical applications. The paper also conducts extensive meta-evaluations, demonstrating the effectiveness of its method in aligning benchmark samples with real-world task distributions, with model rankings highly correlated with crowdsourced real-world evaluations (up to 0.98), while being more efficient. ### Main Contributions 1. **Proposing a Multimodal Benchmark Mixing and Adaptation-Correction Pipeline**: Provides an efficient method to create low-bias any-to-any benchmarks with real-world distributions. 2. **Introducing MixEval-X**: The first high-standard, unified real-world benchmark covering various input-output modalities, reducing bias and heterogeneity in AI evaluation. 3. **Providing Comprehensive Evaluation Results**: Reorganizes the rankings of models and organizations across multiple communities. 4. **Conducting Extensive Meta-Evaluations**: Offers valuable insights to guide AI evaluation and future research. Through these contributions, MixEval-X aims to optimize and standardize evaluations across AI communities, ensuring that the evaluation of unimodal models keeps up with the latest standards and that the evaluation of multimodal models maintains a consistent high standard across different modalities, preventing any modality from becoming a bottleneck.