JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Hossein A. Rahmani,Emine Yilmaz,Nick Craswell,Bhaskar Mitra
2024-12-18
Abstract:The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.
Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve the problem of automatic relevance assessment in information retrieval systems. Specifically: 1. **High - cost and time - consuming relevance labels in manual assessment**: Traditional relevance assessment relies on manual annotation, which is not only expensive but also time - consuming. 2. **Limitations of a single large - language model (LLM)**: Although a single LLM such as GPT - 4 performs well in generating relevance labels, they have high costs and internal model - bias problems, which may lead to favoritism towards specific systems. To solve these problems, the author proposes a new framework named **JudgeBlender**, which provides more robust and accurate relevance assessment by combining multiple smaller open - source models. Specific methods include: - **PromptBlender**: Use a single language model and adopt multiple different prompt strategies to assess the relevance between queries and documents. - **LLMBlender**: Use multiple different language models, and each model assesses the relevance between queries and documents according to its own prompt task. By integrating the outputs of multiple models or prompt strategies, JudgeBlender aims to reduce the inherent bias of a single model and improve the overall accuracy and consistency of relevance assessment. Experimental results show that JudgeBlender performs excellently on multiple evaluation metrics, proving its effectiveness in automatic relevance assessment. ### Formula Representation The relevance score aggregation formula involved in the paper can be represented as: \[ \text{Final Relevance Score} = f(j \in P: j(a)) \] where: - \( P \) is a panel composed of various assessors (i.e., models or prompts), - \( f \) is an aggregation function, such as average or weighted voting. ### Summary The main contributions of this paper are: 1. Proposing a new framework, JudgeBlender, for generating more reliable relevance assessments by integrating multiple models or prompt strategies. 2. Designing and implementing multiple aggregation functions to optimize the final relevance score. 3. Verifying the superior performance of the integration method in relevance assessment through extensive experiments. These improvements make automatic relevance assessment more efficient, accurate, and reduce the dependence on a single large model.