IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui,Lin Yuan,Hongbin Ye,Ningyu Zhang,Mengshu Sun,Lei Liang,Huajun Chen
2024-04-08
Abstract:Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimental results on LLaMA, Baichuan and Qwen demonstrate that using IEPile can enhance the performance of LLMs for IE, especially the zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
Computation and Language,Artificial Intelligence,Databases,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the significant performance gap in using large - language models (LLMs) for information extraction (IE) tasks. Although LLMs perform well in multiple natural - language - processing (NLP) tasks, in terms of information extraction, due to the lack of high - quality large - scale datasets, the performance of LLMs is not satisfactory. Specifically, existing IE datasets are usually small in scale, scattered in distribution, and lack standardized schemas, which limit the performance improvement of LLMs on IE tasks. To overcome these problems, the author introduced a comprehensive bilingual (English and Chinese) IE instruction corpus named IEP ILE, which contains approximately 32 million tokens. This large - scale corpus was constructed by collecting and cleaning 33 existing IE datasets and introducing a schema - based instruction - generation strategy. Experimental results show that using IEP ILE can significantly improve the performance of LLMs in information - extraction tasks, especially in zero - sample generalization ability. ### Main contributions: 1. **Constructing a large - scale IE corpus**: IEP ILE is a bilingual IE instruction corpus containing approximately 32 million tokens, aiming to provide high - quality training data to enhance the information - extraction ability of LLMs. 2. **Schema - based instruction - generation strategy**: A new schema - based instruction - generation method was introduced to solve the problems of inconsistent schema queries and semantic confusion in existing methods. 3. **Experimental verification**: Through experiments on models such as LLaMA, Baichuan, and Qwen, the effectiveness of IEP ILE was proven, especially the improvement in zero - sample generalization ability. ### Specific problems solved: - **Inconsistent schema queries**: During the training and evaluation processes, an inconsistent number of schema queries can lead to a decline in the model's generalization performance. This problem was solved by a batch - processing instruction - generation method that dynamically limits the number of schema queries in each instruction. - **Semantic confusion**: Semantically similar schemas may appear simultaneously in instructions, which may cause the model to be confused. By constructing a hard - negative - sample - schema dictionary, the co - occurrence frequency of semantically similar schemas was increased, and the robustness of the model was improved. In conclusion, this paper effectively improves the performance of LLMs in information - extraction tasks, especially the zero - sample generalization ability, through constructing a high - quality large - scale IE corpus and an improved instruction - generation strategy.