JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

Tong Niu,Shafiq Joty,Ye Liu,Caiming Xiong,Yingbo Zhou,Semih Yavuz
2024-11-01
Abstract:Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank's performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the accuracy of document reranking in information retrieval systems, especially when dealing with tasks that require complex reasoning. Specifically, existing large - language models (LLMs), when used as dense encoders or list - style rerankers, perform well on many tasks, but poorly on tasks that require detailed analysis to judge document relevance. This is because these models lack the ability to perform subtle analysis of document relevance. To address this limitation, the paper proposes **JUDGE RANK**, a novel proxy - based reranking method that aims to mimic the cognitive process of humans in evaluating document relevance. The JUDGE RANK method consists of three key steps: 1. **Query Analysis**: Identify the core issues of the query. 2. **Document Analysis**: Extract document summaries related to the query. 3. **Relevance Judgment**: Provide a concise assessment of document relevance. Through these steps, JUDGE RANK can go beyond superficial lexical matching and use deeper semantic understanding to improve the accuracy of reranking. The paper evaluated JUDGE RANK on the BRIGHT benchmark, which is a benchmark specifically designed to evaluate complex reasoning abilities in generative retrieval tasks. The experimental results show that JUDGE RANK performs significantly better than existing first - stage retrieval methods and other popular reranking methods on the BRIGHT benchmark. In addition, the performance of JUDGE RANK on the BEIR benchmark is also comparable to that of the state - of - the - art rerankers after fine - tuning, verifying its zero - shot generalization ability.