Span-Oriented Information Extraction -- A Unifying Perspective on Information Extraction

Yifan Ding,Michael Yankoski,Tim Weninger
2024-03-19
Abstract:Information Extraction refers to a collection of tasks within Natural Language Processing (NLP) that identifies sub-sequences within text and their labels. These tasks have been used for many years to link extract relevant information and to link free text to structured data. However, the heterogeneity among information extraction tasks impedes progress in this area. We therefore offer a unifying perspective centered on what we define to be spans in text. We then re-orient these seemingly incongruous tasks into this unified perspective and then re-present the wide assortment of information extraction tasks as variants of the same basic Span-Oriented Information Extraction task.
Computation and Language,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
This paper explores the problem of information extraction in natural language processing (NLP). Information extraction tasks aim to identify subsequences (spans) in text and assign labels or link them to external structured knowledge. However, the heterogeneity between these tasks hinders progress in the field. The authors propose a unified view that repositions and understands various information extraction tasks around the basic variant of Span-Oriented Information Extraction, with span as the center. The paper points out that entities in the real world are often represented by subsequences composed of multiple tokens, rather than just single words. Information extraction tasks involve identifying the surface forms of these entities and linking them to entries in knowledge bases or other items in structured databases. However, current methods lack principled approaches to representing spans, resulting in challenges in information extraction, such as zero-shot entity linking, still remaining difficult. The paper further introduces the concept of spans, which simultaneously contains the token subsequence representing the entity (surface form) and its meaning/label. In this way, many challenges in information extraction can be reconsidered, leveraging the capabilities of large language models (LLMs) to enhance considerations of intrinsic spans in generated tokens. The paper reviews the evolution of spans in the history of information extraction, from early computational linguistics and NLP tasks such as parsing and part-of-speech tagging to later information extraction competitions such as MUC and ACE, which have driven the development of information extraction tasks and improvements in evaluation methods. Furthermore, the paper proposes a formal definition of spans and reimagines the diversity of information extraction tasks based on three aspects: span input/output, evaluation methods, and models. The authors demonstrate how various information extraction tasks can be made more consistent by redefining them as span prediction tasks and discuss evaluation methods for different tasks, such as precision, recall, and F1 score, as well as handling flexibility in matching. In conclusion, this paper aims to provide a unified perspective on information extraction to promote progress in the field and provide a more consistent foundation for designing and evaluating information extraction systems.