BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Hongjin Su,Howard Yen,Mengzhou Xia,Weijia Shi,Niklas Muennighoff,Han-yu Wang,Haisu Liu,Quan Shi,Zachary S. Siegel,Michael Tang,Ruoxi Sun,Jinsung Yoon,Sercan O. Arik,Danqi Chen,Tao Yu

2024-10-24

Abstract:Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding. These queries are drawn from naturally occurring and carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard (Muennighoff et al., 2023), which achieves a score of 59.0 nDCG@10, produces a score of nDCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points. Moreover, incorporating retrieved documents from the top-performing retriever boosts question-answering performance by over 6.6 points. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings.

Computation and Language,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the limitations of existing retrieval benchmark datasets (such as BEIR, MTEB, etc.) which primarily focus on information retrieval tasks (e.g., aggregation issues in search engines). These tasks are often accomplished through keyword or semantic matching. However, many complex real-world queries require deep reasoning to find relevant documents, rather than just surface-level matching. For example, finding documents for programming problems requires understanding the logic and syntax of relevant functions. To better evaluate the retrieval performance of such complex queries, the authors introduce BRIGHT, the first text retrieval benchmark that requires extensive reasoning to retrieve relevant documents. The BRIGHT dataset contains 1,384 real-world queries from various fields, including economics, psychology, mathematics, and programming. These queries come from naturally occurring and carefully curated human data. Through extensive evaluation, the study finds that even the current state-of-the-art retrieval models perform poorly on BRIGHT. For instance, the leading model on the MTEB leaderboard scores 59.0 nDCG@10 on the MTEB dataset but only 18.3 nDCG@10 on BRIGHT. The research shows that explicitly reasoning about the query before retrieval can significantly improve retrieval performance, with an increase of up to 12.2 points. Additionally, incorporating documents retrieved by top retrievers into downstream models can significantly enhance question-answering performance, with an increase of up to 6.6 points. Overall, BRIGHT paves the way for future research on retrieval systems in more realistic and challenging scenarios.

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

RAR-b: Reasoning as Retrieval Benchmark

Commonsense Knowledge Salience Evaluation with a Benchmark Dataset in E-commerce

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

CRQBench: A Benchmark of Code Reasoning Questions

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

Beyond Relevance: Evaluate and Improve Retrievers on Perspective Awareness

ReasoningRank: Teaching Student Models to Rank through Reasoning-Based Knowledge Distillation

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Benchmarks for Physical Reasoning AI

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

Open-World Evaluation for Retrieving Diverse Perspectives

The CLRS-Text Algorithmic Reasoning Language Benchmark

Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark