Abstract:Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore and evaluate the performance of large - scale language models (LLMs) and multimodal language models (MLLMs) in different multimodal search scenarios, especially in terms of consistency with human judgment. Specifically, the paper attempts to answer the following questions: 1. **Does LLM performance depend on the usage scenario?** - That is, does the same LLM perform consistently in different application scenarios, or does it perform better in some scenarios and worse in others? 2. **Is there a model that is significantly superior to other models?** - Is there a model that can always perform excellently in all application scenarios? 3. **Is multimodal support necessary for relevance judgment in multimodal search?** - Can multimodal models really improve the accuracy of relevance judgment, or will they reduce performance in some cases instead? 4. **Which models offer the optimal cost - accuracy trade - off?** - Based on considering cost and accuracy, which models are the optimal choices? Through these questions, the paper attempts to provide guidance for researchers and practitioners on how to select the most appropriate LLM or MLLM for multimodal search relevance evaluation. The research results show that the performance of a model depends not only on its scale and technical characteristics but also on the specific application scenario. In addition, the research also reveals that multimodal support is not always beneficial, especially for smaller models, and the visual component may sometimes reduce performance. ### Formula Explanation The paper does not involve complex mathematical, physical, chemical, or biological formulas, so there is no need to present formulas in a special Markdown format. But if there are formulas to be expressed, such as the calculation of Cohen’s kappa coefficient, it can be represented as follows: \[ \kappa=\frac{p_o - p_e}{1 - p_e} \] where: - \(p_o\) represents the proportion of observed agreement. - \(p_e\) represents the proportion of chance agreement. In this way, the formula is ensured to be clear and easy to read.

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Large Language Models for Relevance Judgment in Product Search

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Leveraging Large Language Models for Multimodal Search

A Survey on Benchmarks of Multimodal Large Language Models

A Survey on Evaluation of Multimodal Large Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

Reasons to Reject? Aligning Language Models with Judgments

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Exploring Large Language Models for Relevance Judgments in Tetun

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation