Abstract:Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimental results on LLaMA, Baichuan and Qwen demonstrate that using IEPile can enhance the performance of LLMs for IE, especially the zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant performance gap in using large - language models (LLMs) for information extraction (IE) tasks. Although LLMs perform well in multiple natural - language - processing (NLP) tasks, in terms of information extraction, due to the lack of high - quality large - scale datasets, the performance of LLMs is not satisfactory. Specifically, existing IE datasets are usually small in scale, scattered in distribution, and lack standardized schemas, which limit the performance improvement of LLMs on IE tasks. To overcome these problems, the author introduced a comprehensive bilingual (English and Chinese) IE instruction corpus named IEP ILE, which contains approximately 32 million tokens. This large - scale corpus was constructed by collecting and cleaning 33 existing IE datasets and introducing a schema - based instruction - generation strategy. Experimental results show that using IEP ILE can significantly improve the performance of LLMs in information - extraction tasks, especially in zero - sample generalization ability. ### Main contributions: 1. **Constructing a large - scale IE corpus**: IEP ILE is a bilingual IE instruction corpus containing approximately 32 million tokens, aiming to provide high - quality training data to enhance the information - extraction ability of LLMs. 2. **Schema - based instruction - generation strategy**: A new schema - based instruction - generation method was introduced to solve the problems of inconsistent schema queries and semantic confusion in existing methods. 3. **Experimental verification**: Through experiments on models such as LLaMA, Baichuan, and Qwen, the effectiveness of IEP ILE was proven, especially the improvement in zero - sample generalization ability. ### Specific problems solved: - **Inconsistent schema queries**: During the training and evaluation processes, an inconsistent number of schema queries can lead to a decline in the model's generalization performance. This problem was solved by a batch - processing instruction - generation method that dynamically limits the number of schema queries in each instruction. - **Semantic confusion**: Semantically similar schemas may appear simultaneously in instructions, which may cause the model to be confused. By constructing a hard - negative - sample - schema dictionary, the co - occurrence frequency of semantically similar schemas was increased, and the robustness of the model was improved. In conclusion, this paper effectively improves the performance of LLMs in information - extraction tasks, especially the zero - sample generalization ability, through constructing a high - quality large - scale IE corpus and an improved instruction - generation strategy.

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

IEPile: Unearthing Large Scale Schema-Conditioned Information Extraction Corpus

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

Diluie: Constructing Diverse Demonstrations of In-Context Learning with Large Language Model for Unified Information Extraction

CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors

ADELIE: Aligning Large Language Models on Information Extraction

Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment

Large Language Models for Generative Information Extraction: A Survey

Retrieval-Augmented Code Generation for Universal Information Extraction

Unified Structure Generation for Universal Information Extraction

RUIE: Retrieval-based Unified Information Extraction using Large Language Model

AlignXIE: Improving Multilingual Information Extraction by Cross-Lingual Alignment

Assessing the Performance of Chinese Open Source Large Language Models in Information Extraction Tasks

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

A Survey on Open Information Extraction from Rule-based Model to Large Language Model (meta)

Instruction Embedding: Latent Representations of Instructions Towards Task Identification

IELM: an Open Information Extraction Benchmark for Pre-Trained Language Models

PIVOINE: Instruction Tuning for Open-world Information Extraction

RexUIE: A Recursive Method with Explicit Schema Instructor for Universal Information Extraction