Abstract:The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

What problem does this paper attempt to address?

This paper attempts to solve the problem of automatic relevance assessment in information retrieval systems. Specifically: 1. **High - cost and time - consuming relevance labels in manual assessment**: Traditional relevance assessment relies on manual annotation, which is not only expensive but also time - consuming. 2. **Limitations of a single large - language model (LLM)**: Although a single LLM such as GPT - 4 performs well in generating relevance labels, they have high costs and internal model - bias problems, which may lead to favoritism towards specific systems. To solve these problems, the author proposes a new framework named **JudgeBlender**, which provides more robust and accurate relevance assessment by combining multiple smaller open - source models. Specific methods include: - **PromptBlender**: Use a single language model and adopt multiple different prompt strategies to assess the relevance between queries and documents. - **LLMBlender**: Use multiple different language models, and each model assesses the relevance between queries and documents according to its own prompt task. By integrating the outputs of multiple models or prompt strategies, JudgeBlender aims to reduce the inherent bias of a single model and improve the overall accuracy and consistency of relevance assessment. Experimental results show that JudgeBlender performs excellently on multiple evaluation metrics, proving its effectiveness in automatic relevance assessment. ### Formula Representation The relevance score aggregation formula involved in the paper can be represented as: \[ \text{Final Relevance Score} = f(j \in P: j(a)) \] where: - \( P \) is a panel composed of various assessors (i.e., models or prompts), - \( f \) is an aggregation function, such as average or weighted voting. ### Summary The main contributions of this paper are: 1. Proposing a new framework, JudgeBlender, for generating more reliable relevance assessments by integrating multiple models or prompt strategies. 2. Designing and implementing multiple aggregation functions to optimize the final relevance score. 3. Verifying the superior performance of the integration method in relevance assessment through extensive experiments. These improvements make automatic relevance assessment more efficient, accurate, and reduce the dependence on a single large model.

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

LLMJudge: LLMs for Relevance Judgments

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

JudgeBench: A Benchmark for Evaluating LLM-based Judges

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Can We Use Large Language Models to Fill Relevance Judgment Holes?

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Reasons to Reject? Aligning Language Models with Judgments

Large Language Models for Relevance Judgment in Product Search