Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

Junjie Ye,Yuming Yang,Qi Zhang,Tao Gui,Xuanjing Huang,Peng Wang,Zhongchao Shi,Jianping Fan
2024-09-24
Abstract:Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue that effective fine-tuning strategies for large-scale language models (LLMs) in question-answering tasks have not been fully explored. Specifically, the paper focuses on the following key issues: 1. **The amount of training data required for fine-tuning**: Researchers aim to understand how much data is needed during the supervised fine-tuning (SFT) phase to enable LLMs to effectively perform question-answering tasks and exhibit strong generalization capabilities. 2. **The impact of different fine-tuning datasets on LLM performance**: Researchers seek to explore the specific impact of data with different levels of memorization on LLM performance by classifying training and testing data according to memorization levels. 3. **Data requirements differences among different LLMs**: Researchers also aim to understand whether different LLMs have varying data requirements during the fine-tuning phase and how these differences affect the final performance in question-answering tasks. To answer these questions, researchers designed a series of experiments using four LLMs from three different model families. They evaluated the pre-trained LLMs' memorization of different types of knowledge through a multi-template complementary mechanism and conducted detailed empirical analyses.