AlignXIE: Improving Multilingual Information Extraction by Cross-Lingual Alignment

Yuxin Zuo,Wenxuan Jiang,Wenxuan Liu,Zixuan Li,Long Bai,Hanbin Wang,Yutao Zeng,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
2024-11-07
Abstract:Empirical evidence suggests that LLMs exhibit spontaneous cross-lingual alignment. Our findings suggest that although LLMs also demonstrate promising cross-lingual alignment in Information Extraction, there remains significant imbalance across languages, revealing an underlying deficiency in the IE alignment. To address this issue, we propose AlignXIE, a powerful code-based LLM that significantly enhances cross-lingual IE alignment through two strategies. Firstly, AlignXIE formulates IE across different languages, especially non-English ones, as code generation tasks, standardizing the representation of various schemas using Python classes to ensure consistency of the same ontology in different languages and align the schema. Secondly, it incorporates an IE cross-lingual alignment phase through a translated instance prediction task proposed in this paper to align the extraction process, utilizing ParallelNER, an IE bilingual parallel dataset with 257,190 samples, generated by our proposed LLM-based automatic pipeline for IE parallel data construction, with manual annotation to ensure quality. Ultimately, we obtain AlignXIE through multilingual IE instruction tuning. Although without training in 9 unseen languages, AlignXIE surpasses ChatGPT by $30.17\%$ and SoTA by $20.03\%$, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 63 IE benchmarks in Chinese and English under various settings, demonstrate that AlignXIE significantly enhances cross-lingual and multilingual IE through boosting the IE alignment.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### The Problem Addressed by the Paper The paper aims to address the issue of cross-lingual alignment in Multilingual Information Extraction (Multilingual IE). Although Large Language Models (LLMs) exhibit some spontaneous cross-lingual alignment capabilities in information extraction tasks, there remains a significant imbalance in alignment between different languages, especially in non-English languages. This imbalance reveals potential flaws in cross-lingual alignment for information extraction. Specifically, the paper raises the following two main issues: 1. **Imbalance in Cross-Lingual Alignment**: Despite LLMs showing some cross-lingual alignment capabilities in information extraction tasks, there is still a significant imbalance in alignment between different languages, particularly in non-English languages. 2. **Performance Gap in Cross-Lingual Information Extraction**: There is a significant performance gap in information extraction between different languages, indicating that existing cross-lingual alignment methods perform poorly in some languages. To address these issues, the paper proposes a method called AlignXIE, which significantly enhances cross-lingual alignment in information extraction through two strategies: 1. **Unified Code Generation Framework**: Standardizes information extraction tasks in different languages as code generation tasks, using Python classes to represent various patterns, ensuring consistency of the same ontology across different languages. 2. **Cross-Lingual Alignment Phase**: Enhances the alignment of the extraction process through a translation instance prediction task, utilizing the ParallelNER parallel dataset for alignment. Ultimately, AlignXIE is obtained through multilingual information extraction instruction tuning and demonstrates significantly superior cross-lingual information extraction capabilities compared to existing methods in multiple benchmark tests.