JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

Tong Niu,Shafiq Joty,Ye Liu,Caiming Xiong,Yingbo Zhou,Semih Yavuz

2024-11-01

Abstract:Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank's performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the accuracy of document reranking in information retrieval systems, especially when dealing with tasks that require complex reasoning. Specifically, existing large - language models (LLMs), when used as dense encoders or list - style rerankers, perform well on many tasks, but poorly on tasks that require detailed analysis to judge document relevance. This is because these models lack the ability to perform subtle analysis of document relevance. To address this limitation, the paper proposes **JUDGE RANK**, a novel proxy - based reranking method that aims to mimic the cognitive process of humans in evaluating document relevance. The JUDGE RANK method consists of three key steps: 1. **Query Analysis**: Identify the core issues of the query. 2. **Document Analysis**: Extract document summaries related to the query. 3. **Relevance Judgment**: Provide a concise assessment of document relevance. Through these steps, JUDGE RANK can go beyond superficial lexical matching and use deeper semantic understanding to improve the accuracy of reranking. The paper evaluated JUDGE RANK on the BRIGHT benchmark, which is a benchmark specifically designed to evaluate complex reasoning abilities in generative retrieval tasks. The experimental results show that JUDGE RANK performs significantly better than existing first - stage retrieval methods and other popular reranking methods on the BRIGHT benchmark. In addition, the performance of JUDGE RANK on the BEIR benchmark is also comparable to that of the state - of - the - art rerankers after fine - tuning, verifying its zero - shot generalization ability.

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

ReasoningRank: Teaching Student Models to Rank through Reasoning-Based Knowledge Distillation

LegalReasoner: A Multi-Stage Framework for Legal Judgment Prediction via Large Language Models and Knowledge Integration

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

R 2 : A Novel Recall & Ranking Framework for Legal Judgment Prediction

RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG

Leveraging LLM Reasoning Enhances Personalized Recommender Systems

RaFe: Ranking Feedback Improves Query Rewriting for RAG

Drowning in Documents: Consequences of Scaling Reranker Inference

Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation

Zero-Shot Listwise Document Reranking with a Large Language Model

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models