Learning to Extract Structured Entities Using Language Models

Haolun Wu,Ye Yuan,Liana Mikaelyan,Alexander Meulemans,Xue Liu,James Hensman,Bhaskar Mitra
2024-10-02
Abstract:Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new Multistage Structured Entity Extraction (MuSEE) model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction. Our source code and datasets are available at <a class="link-external link-https" href="https://github.com/microsoft/Structured-Entity-Extraction" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues in Information Extraction (IE): 1. **Limitations of Existing Methods**: Traditional information extraction methods mainly focus on extracting triples (⟨subject, relation, object⟩) and use traditional metrics such as Precision and Recall for evaluation. These methods fall short in assessing the model's overall understanding of the text, especially when one entity is associated with multiple relations while other entities are only associated with a few relations. 2. **Entity-Level Evaluation**: Existing evaluation methods often overlook entity-level normalization, particularly when dealing with multiple entities sharing the same name or lacking unique identifiers. This leads to inaccuracies and misleading evaluation results. 3. **Structured Entity Extraction**: The paper proposes a new task format—Structured Entity Extraction, redefining the information extraction task as an entity-centric task rather than a triple-centric one. This new format allows for the use of diverse evaluation metrics, providing more insights from different perspectives. 4. **Improvement of Evaluation Metrics**: To better evaluate the model's performance in the structured entity extraction task, the paper introduces a new evaluation metric—Approximate Entity Set OverlaP (AESOP). This metric is more flexible and can include different levels of normalization, thus providing a more comprehensive assessment of the model's performance. 5. **Innovation in Model Architecture**: The paper proposes a new multi-stage structured entity extraction model (Multi-stage Structured Entity Extraction, MuSEE), leveraging the advantages of Language Models (LMs). By decomposing the extraction task into multiple stages, the model improves effectiveness and efficiency. Each stage can be processed in parallel, reducing the number of generated tokens and further enhancing training and inference efficiency. ### Summary The main contributions of the paper include: - Proposing an entity-centric information extraction task format called structured entity extraction. - Introducing a new evaluation metric, AESOP, for a more comprehensive assessment of the model's performance in structured entity extraction tasks. - Designing a new model architecture, MuSEE, which improves the model's effectiveness and efficiency through multi-stage parallel generation and reduced output token count. These innovations address the shortcomings of existing information extraction methods in evaluation and practical application, providing new directions for future research in structured entity extraction.