What problem does this paper attempt to address?

This paper attempts to solve the problem of extracting long object lists from long texts, especially a challenge in the field of Information Extraction (IE): how to extract a long list of object entities related to a given topic from long documents. Specifically, existing relation extraction methods usually focus on high precision but have a low recall rate, which limits their application in generating long object lists. For example, extracting all the friends of characters from the Harry Potter series of books, or extracting all subsidiaries of Alphabet Inc. from an entire website. ### Main contributions of the paper: 1. **New task**: Propose a new task, that is, to extract a long list of objects from long documents (such as books), given a topic and a relation. 2. **Methodology**: Develop a method named L3X, which combines retrieval - enhanced large - scale language models (LLM) and information retrieval techniques, and is divided into two stages: - **Stage 1: Recall - oriented generation**: Use LLM to generate an object list and improve the recall rate through retrieval - enhancement techniques such as re - ranking and batching. - **Stage 2: Precision - oriented review**: Verify and prune the object candidates generated in the first stage to improve precision. 3. **Experimental results**: Construct a new dataset containing 10 books and 8 relations, and show the significant advantages of the L3X method over the LLM - only baseline method in the experiment. ### Key technical details: - **Recall - oriented generation**: - **Direct prompting**: Use the book title, topic, and relation as inputs to directly prompt LLM to generate an initial object list. - **Retrieval**: Use sparse or dense retrievers to retrieve a large number of relevant paragraphs from long texts. - **Re - ranking**: Re - rank paragraphs according to methods such as named - entity mention frequency, diverse selection, and pseudo - relevant feedback. - **Batching**: Group paragraphs with similar entity references or narratives to provide semantically coherent inputs to LLM. - **Iteration**: Analyze the paragraph pool, re - prioritize, and iterate the above steps to improve the recall rate. - **Precision - oriented review**: - **Evidence retrieval**: Search the entire document to find text fragments that support SPO triples. - **Classifier**: Design multiple classifiers, including score - based thresholds, confidence extraction, predicate - specific classifiers, and discriminant classifiers, to verify and prune object candidates. ### Experimental setup and evaluation metrics: - **Dataset**: Construct a new dataset containing 10 popular novels and entire book series, with a total of about 16,000 pages, covering about 4,000 entities and about 7,300 aliases. - **Evaluation metrics**: Mainly focus on Recall@Precision (R@P), especially R@P50 and R@P80, as well as other absolute precision and recall metrics. ### Main findings: - **Recall rate improvement**: The L3X method significantly improves the recall rate in the first stage, up to 75%, while the recall rate of the LLM - only baseline method is about 50%. - **Precision optimization**: Through the precision - oriented review in the second stage, the L3X method improves precision while maintaining a high recall rate, especially performing excellently in the R@P50 and R@P80 metrics. In conclusion, this paper proposes an effective method to solve the challenge of extracting long object lists from long documents, providing new ideas and technical means for research in the field of information extraction.

Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents

Extracting Multi-valued Relations from Language Models

Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction

Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction

Generative Retrieval with Large Language Models

R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models

Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks

Towards Completeness-Oriented Tool Retrieval for Large Language Models

Retrieve Anything To Augment Large Language Models

Retrieval-Augmented Generation-based Relation Extraction

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Large Language Models are Strong Zero-Shot Retriever

Improving Tool Retrieval by Leveraging Large Language Models for Query Generation

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

RRAML: Reinforced Retrieval Augmented Machine Learning

RET-LLM: Towards a General Read-Write Memory for Large Language Models

High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models

LMDX: Language Model-based Document Information Extraction and Localization

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models