Leveraging Large Language Models for Enhanced NLP Task Performance through Knowledge Distillation and Optimized Training Strategies

Yining Huang,Keke Tang,Meilian Chen
2024-03-24
Abstract:Emerging Large Language Models (LLMs) like GPT-4 have revolutionized Natural Language Processing (NLP), showing potential in traditional tasks such as Named Entity Recognition (NER). Our study explores a three-phase training strategy that harnesses GPT-4's capabilities to enhance the BERT model's performance on NER. Initially, GPT-4 annotates a subset of the CONLL2003 and additional BBC dataset without fine-tuning. We then train BERT using a mix of original and LLM-annotated data, analyzing the efficacy of LLM annotations against traditional methods. The second phase involves comparative experiments with different training regimens, assessing the synergy between distilled and original data. We observe that sequential strategies, particularly a simple mix of training first with distilled data followed by original data, significantly boost performance. In the third phase, we investigate various data blending techniques, including sigmoid and power decay functions, to optimize the training process further. Our results indicate that a strategic mix of distilled and original data markedly elevates the NER capabilities of BERT. Our approach presents a scalable methodology that reduces manual annotation costs and increases efficiency, making it especially pertinent in resource-limited and closed-network environments. The study concludes that while the 'Simple Mix' strategy yields the best results, understanding its underlying mechanisms requires further research. Future work will also focus on refining prompt designs and enhancing annotation selection processes, aiming to extend our methodology to diverse NLP tasks.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to enhance the performance of the BERT model in the named entity recognition (NER) task by using large - language models (such as GPT - 4) through knowledge distillation and optimizing training strategies. Specifically, the researchers explored a three - stage training strategy: 1. **Data Annotation Stage**: - Use GPT - 4 to annotate a subset of the CONLL2003 and BBC News datasets without fine - tuning GPT - 4. - Compare the quality of the annotations generated by GPT - 4 with that of traditional manual annotations. 2. **Model Training Stage**: - Train the BERT model using mixed data (original data and data annotated by LLM) and evaluate the effectiveness of the data annotated by LLM. - Conduct comparative experiments of different training strategies, including training only with distilled data, only with original data, and training with different proportions of mixed data. 3. **Data Fusion Technique Stage**: - Explore various data fusion techniques, such as the sigmoid decay function and the power - law decay function, to further optimize the training process. - Analyze the impact of different data fusion techniques on model performance. Through these steps, the researchers hope to verify a scalable method, reduce the cost of manual annotation, improve efficiency, and enable this method to be effectively applied in resource - limited and closed - network environments as well. Eventually, the research shows that the simple sequential mixed - data strategy (first training with distilled data and then with original data) significantly improves the NER ability of the BERT model.