Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Silvia Terragni,Hoang Cuong,Joachim Daiber,Pallavi Gudipati,Pablo N. Mendes
2024-10-26
Abstract:Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.
Machine Learning,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore and evaluate the performance of large - scale language models (LLMs) and multimodal language models (MLLMs) in different multimodal search scenarios, especially in terms of consistency with human judgment. Specifically, the paper attempts to answer the following questions: 1. **Does LLM performance depend on the usage scenario?** - That is, does the same LLM perform consistently in different application scenarios, or does it perform better in some scenarios and worse in others? 2. **Is there a model that is significantly superior to other models?** - Is there a model that can always perform excellently in all application scenarios? 3. **Is multimodal support necessary for relevance judgment in multimodal search?** - Can multimodal models really improve the accuracy of relevance judgment, or will they reduce performance in some cases instead? 4. **Which models offer the optimal cost - accuracy trade - off?** - Based on considering cost and accuracy, which models are the optimal choices? Through these questions, the paper attempts to provide guidance for researchers and practitioners on how to select the most appropriate LLM or MLLM for multimodal search relevance evaluation. The research results show that the performance of a model depends not only on its scale and technical characteristics but also on the specific application scenario. In addition, the research also reveals that multimodal support is not always beneficial, especially for smaller models, and the visual component may sometimes reduce performance. ### Formula Explanation The paper does not involve complex mathematical, physical, chemical, or biological formulas, so there is no need to present formulas in a special Markdown format. But if there are formulas to be expressed, such as the calculation of Cohen’s kappa coefficient, it can be represented as follows: \[ \kappa=\frac{p_o - p_e}{1 - p_e} \] where: - \(p_o\) represents the proportion of observed agreement. - \(p_e\) represents the proportion of chance agreement. In this way, the formula is ensured to be clear and easy to read.