Abstract:Information Extraction refers to a collection of tasks within Natural Language Processing (NLP) that identifies sub-sequences within text and their labels. These tasks have been used for many years to link extract relevant information and to link free text to structured data. However, the heterogeneity among information extraction tasks impedes progress in this area. We therefore offer a unifying perspective centered on what we define to be spans in text. We then re-orient these seemingly incongruous tasks into this unified perspective and then re-present the wide assortment of information extraction tasks as variants of the same basic Span-Oriented Information Extraction task.

What problem does this paper attempt to address?

This paper explores the problem of information extraction in natural language processing (NLP). Information extraction tasks aim to identify subsequences (spans) in text and assign labels or link them to external structured knowledge. However, the heterogeneity between these tasks hinders progress in the field. The authors propose a unified view that repositions and understands various information extraction tasks around the basic variant of Span-Oriented Information Extraction, with span as the center. The paper points out that entities in the real world are often represented by subsequences composed of multiple tokens, rather than just single words. Information extraction tasks involve identifying the surface forms of these entities and linking them to entries in knowledge bases or other items in structured databases. However, current methods lack principled approaches to representing spans, resulting in challenges in information extraction, such as zero-shot entity linking, still remaining difficult. The paper further introduces the concept of spans, which simultaneously contains the token subsequence representing the entity (surface form) and its meaning/label. In this way, many challenges in information extraction can be reconsidered, leveraging the capabilities of large language models (LLMs) to enhance considerations of intrinsic spans in generated tokens. The paper reviews the evolution of spans in the history of information extraction, from early computational linguistics and NLP tasks such as parsing and part-of-speech tagging to later information extraction competitions such as MUC and ACE, which have driven the development of information extraction tasks and improvements in evaluation methods. Furthermore, the paper proposes a formal definition of spans and reimagines the diversity of information extraction tasks based on three aspects: span input/output, evaluation methods, and models. The authors demonstrate how various information extraction tasks can be made more consistent by redefining them as span prediction tasks and discuss evaluation methods for different tasks, such as precision, recall, and F1 score, as well as handling flexibility in matching. In conclusion, this paper aims to provide a unified perspective on information extraction to promote progress in the field and provide a more consistent foundation for designing and evaluating information extraction systems.

Span-Oriented Information Extraction -- A Unifying Perspective on Information Extraction

UniEX: An Effective and Efficient Framework for Unified Information Extraction via a Span-extractive Perspective

Entity, Relation, and Event Extraction with Contextualized Span Representations

Research on Information Extraction:A Survey

FSUIE: A Novel Fuzzy Span Mechanism for Universal Information Extraction

An Overview of Temporal Information Extraction.

A span-based model for aspect terms extraction and aspect sentiment classification

UTC-IE: A Unified Token-pair Classification Architecture for Information Extraction

Span-based single-stage joint entity-relation extraction model

SpanRE: Entities and Overlapping Relations Extraction Based on Spans and Entity Attention

A Two-Phase Paradigm for Joint Entity-Relation Extraction

Enhanced Language Representation with Label Knowledge for Span Extraction

Extracting all Aspect-polarity Pairs Jointly in a Text with Relation Extraction Approach

Split-Correctness in Information Extraction

Unified Structure Generation for Universal Information Extraction

A framework for extraction and transformation of documents

Dealing with negative samples with multi-task learning on span-based joint entity-relation extraction

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

Span-based joint entity and relation extraction augmented with sequence tagging mechanism

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content