Abstract:Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new Multistage Structured Entity Extraction (MuSEE) model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction. Our source code and datasets are available at <a class="link-external link-https" href="https://github.com/microsoft/Structured-Entity-Extraction" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues in Information Extraction (IE): 1. **Limitations of Existing Methods**: Traditional information extraction methods mainly focus on extracting triples (⟨subject, relation, object⟩) and use traditional metrics such as Precision and Recall for evaluation. These methods fall short in assessing the model's overall understanding of the text, especially when one entity is associated with multiple relations while other entities are only associated with a few relations. 2. **Entity-Level Evaluation**: Existing evaluation methods often overlook entity-level normalization, particularly when dealing with multiple entities sharing the same name or lacking unique identifiers. This leads to inaccuracies and misleading evaluation results. 3. **Structured Entity Extraction**: The paper proposes a new task format—Structured Entity Extraction, redefining the information extraction task as an entity-centric task rather than a triple-centric one. This new format allows for the use of diverse evaluation metrics, providing more insights from different perspectives. 4. **Improvement of Evaluation Metrics**: To better evaluate the model's performance in the structured entity extraction task, the paper introduces a new evaluation metric—Approximate Entity Set OverlaP (AESOP). This metric is more flexible and can include different levels of normalization, thus providing a more comprehensive assessment of the model's performance. 5. **Innovation in Model Architecture**: The paper proposes a new multi-stage structured entity extraction model (Multi-stage Structured Entity Extraction, MuSEE), leveraging the advantages of Language Models (LMs). By decomposing the extraction task into multiple stages, the model improves effectiveness and efficiency. Each stage can be processed in parallel, reducing the number of generated tokens and further enhancing training and inference efficiency. ### Summary The main contributions of the paper include: - Proposing an entity-centric information extraction task format called structured entity extraction. - Introducing a new evaluation metric, AESOP, for a more comprehensive assessment of the model's performance in structured entity extraction tasks. - Designing a new model architecture, MuSEE, which improves the model's effectiveness and efficiency through multi-stage parallel generation and reduced output token count. These innovations address the shortcomings of existing information extraction methods in evaluation and practical application, providing new directions for future research in structured entity extraction.

Learning to Extract Structured Entities Using Language Models

Structured Entity Extraction Using Large Language Models

Struct-X: Enhancing Large Language Models Reasoning with Structured Data

LMDX: Language Model-based Document Information Extraction and Localization

Schema-Driven Information Extraction from Heterogeneous Tables

Structured information extraction from complex scientific text with fine-tuned large language models

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Leveraging Large Language Models for Entity Matching

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Structured information extraction from scientific text with large language models

Large Language Models for Generative Information Extraction: A Survey

Unlocking the Power of Large Language Models for Entity Alignment

Unified Text Structuralization with Instruction-tuned Language Models

StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

Entity Matching using Large Language Models

Effective and Efficient Retrieval of Structured Entities

Representation Learning of Structured Data for Medical Foundation Models

A Simple but Effective Approach to Improve Structured Language Model Output for Information Extraction

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach