An Effective Incorporating Heterogeneous Knowledge Curriculum Learning for Sequence Labeling

Xuemei Tang,Qi Su
2024-02-21
Abstract:Sequence labeling models often benefit from incorporating external knowledge. However, this practice introduces data heterogeneity and complicates the model with additional modules, leading to increased expenses for training a high-performing model. To address this challenge, we propose a two-stage curriculum learning (TCL) framework specifically designed for sequence labeling tasks. The TCL framework enhances training by gradually introducing data instances from easy to hard, aiming to improve both performance and training speed. Furthermore, we explore different metrics for assessing the difficulty levels of sequence labeling tasks. Through extensive experimentation on six Chinese word segmentation (CWS) and Part-of-speech tagging (POS) datasets, we demonstrate the effectiveness of our model in enhancing the performance of sequence labeling models. Additionally, our analysis indicates that TCL accelerates training and alleviates the slow training problem associated with complex models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores how to effectively integrate heterogeneous knowledge and solve the problem of training complexity in sequence labeling tasks. Although existing methods have improved model performance by introducing external knowledge such as n-grams, dictionaries, and syntactic information, they have also increased the heterogeneity of the data and the complexity of the model, leading to increased training time and resource consumption. To address this problem, the paper proposes a two-stage curriculum learning (TCL) framework specifically designed for sequence labeling tasks. In the first stage, data-level curriculum learning, a simple transfer teacher model is trained using all the data to provide initial sample sorting for the student model and help it warm up. In the second stage, model-level curriculum learning, the student model is trained starting from the subset selected by the teacher model, gradually expanding the training subset based on data difficulty and student model status. Additionally, the paper explores various indicators for evaluating the difficulty of sequence labeling tasks, including pre-defined sentence lengths, and Top-N Minimum Confidence (TLC), Maximum Normalized Log Probability (MNLP), and Bayesian Uncertainty (BU) based on model uncertainty. Experimental results demonstrate that the proposed TCL framework can accelerate training speed, improve model performance, and is applicable to other sequence labeling models. Through extensive evaluation on six Chinese part-of-speech tagging and word segmentation datasets, the TCL model demonstrates good performance, particularly on large-scale datasets, aligning with the advantages of curriculum learning in handling data heterogeneity.