Abstract:Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall. High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject. Cues for relevant objects can be spread across many passages in long texts. This poses the challenge of extracting long lists from long texts. We present the L3X method which tackles the problem in two stages: (1) recall-oriented generation using a large language model (LLM) with judicious techniques for retrieval augmentation, and (2) precision-oriented scrutinization to validate or prune candidates. Our L3X method outperforms LLM-only generations by a substantial margin.
What problem does this paper attempt to address?
This paper attempts to solve the problem of extracting long object lists from long texts, especially a challenge in the field of Information Extraction (IE): how to extract a long list of object entities related to a given topic from long documents. Specifically, existing relation extraction methods usually focus on high precision but have a low recall rate, which limits their application in generating long object lists. For example, extracting all the friends of characters from the Harry Potter series of books, or extracting all subsidiaries of Alphabet Inc. from an entire website.
### Main contributions of the paper:
1. **New task**: Propose a new task, that is, to extract a long list of objects from long documents (such as books), given a topic and a relation.
2. **Methodology**: Develop a method named L3X, which combines retrieval - enhanced large - scale language models (LLM) and information retrieval techniques, and is divided into two stages:
- **Stage 1: Recall - oriented generation**: Use LLM to generate an object list and improve the recall rate through retrieval - enhancement techniques such as re - ranking and batching.
- **Stage 2: Precision - oriented review**: Verify and prune the object candidates generated in the first stage to improve precision.
3. **Experimental results**: Construct a new dataset containing 10 books and 8 relations, and show the significant advantages of the L3X method over the LLM - only baseline method in the experiment.
### Key technical details:
- **Recall - oriented generation**:
- **Direct prompting**: Use the book title, topic, and relation as inputs to directly prompt LLM to generate an initial object list.
- **Retrieval**: Use sparse or dense retrievers to retrieve a large number of relevant paragraphs from long texts.
- **Re - ranking**: Re - rank paragraphs according to methods such as named - entity mention frequency, diverse selection, and pseudo - relevant feedback.
- **Batching**: Group paragraphs with similar entity references or narratives to provide semantically coherent inputs to LLM.
- **Iteration**: Analyze the paragraph pool, re - prioritize, and iterate the above steps to improve the recall rate.
- **Precision - oriented review**:
- **Evidence retrieval**: Search the entire document to find text fragments that support SPO triples.
- **Classifier**: Design multiple classifiers, including score - based thresholds, confidence extraction, predicate - specific classifiers, and discriminant classifiers, to verify and prune object candidates.
### Experimental setup and evaluation metrics:
- **Dataset**: Construct a new dataset containing 10 popular novels and entire book series, with a total of about 16,000 pages, covering about 4,000 entities and about 7,300 aliases.
- **Evaluation metrics**: Mainly focus on Recall@Precision (R@P), especially R@P50 and R@P80, as well as other absolute precision and recall metrics.
### Main findings:
- **Recall rate improvement**: The L3X method significantly improves the recall rate in the first stage, up to 75%, while the recall rate of the LLM - only baseline method is about 50%.
- **Precision optimization**: Through the precision - oriented review in the second stage, the L3X method improves precision while maintaining a high recall rate, especially performing excellently in the R@P50 and R@P80 metrics.
In conclusion, this paper proposes an effective method to solve the challenge of extracting long object lists from long documents, providing new ideas and technical means for research in the field of information extraction.