Does the Order of Fine-tuning Matter and Why?

Qihong Chen,Jiawei Li,Hyunjae Suh,Lianghao Jiang,Zheng Zhou,Jingze Chen,Jiri Gesi,Iftekhar Ahmed
2024-10-04
Abstract:To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Does the order of fine - tuning tasks have an impact on the performance of the target task, and why does this impact exist?** Specifically, the paper focuses on the impact of different task orders on the performance of the final target task when fine - tuning through multiple intermediate tasks in software engineering (SE) tasks. Previous research has mainly focused on the field of natural language processing (NLP) and only considered one intermediate task, while this paper systematically studies the impact of multiple intermediate tasks and their order on software engineering tasks for the first time. ### Research Background and Problem Description 1. **Background**: - The development of self - supervised learning has enabled large - scale language models to perform well in various natural language processing tasks. - To further improve the performance of models in downstream tasks, researchers have introduced the method of intermediate - task fine - tuning, that is, fine - tuning the model for one or more intermediate tasks before the target task. - However, the choice of the order of intermediate tasks may lead to different performance results, and may even produce negative transfer, that is, the fine - tuning of intermediate tasks instead reduces the performance of the target task. 2. **Problem Description**: - In the field of software engineering, source code has unique characteristics (such as grammatical constraints and semantic structures) different from natural language, so the findings in the NLP field cannot be directly applied to SE tasks. - This study aims to explore the impact of the fine - tuning order of multiple intermediate tasks on the performance of the target task and analyze the reasons behind it. ### Research Methods To answer the above questions, the paper has carried out the following work: 1. **Experimental Design**: - Four SE tasks were selected: Code Clone Detection (CD), Defect Detection (DD), Code Repair (CR) and Code Translation (CT). - CodeBERT was used as a pre - trained model, and experiments on single - intermediate - task fine - tuning and multiple - intermediate - task fine - tuning for these tasks were carried out. - All possible task - order combinations were evaluated by 10 - fold cross - validation, with a total of 60 fine - tuning - chain models. 2. **Data Analysis**: - **Dataset Feature Analysis**: Including the analysis of grammatical similarity, semantic similarity and dataset size. - **Grammatical Similarity Analysis**: The grammatical similarity between datasets was calculated by comparing keywords, operators and identifiers in code fragments. - **Semmatical Similarity Analysis**: The vector representation of code fragments was generated by CodeBERT, and the semantic similarity was calculated by cosine similarity. - **Dataset Size Analysis**: The impact of dataset size on the performance of the target task was explored. - **Task Feature Analysis**: The knowledge transfer effect between tasks was measured by Task Affinity Analysis. 3. **Results and Discussion**: - The experimental results show that different task orders do have a significant impact on the performance of the target task, with a maximum performance improvement of 6% or a performance decline of 4%. - The paper further analyzes the factors that may cause this impact, including dataset features, task features and model characteristics. ### Conclusions The main contributions of this paper are: 1. For the first time, it systematically studies the impact of the fine - tuning order of multiple intermediate tasks on the performance of software engineering target tasks. 2. Through multi - dimensional analysis, it explains why the task order has an impact on performance. 3. It provides practical suggestions for choosing the most cost - effective task order to achieve the best performance gain. These findings provide valuable references for researchers and practitioners in the field of software engineering, helping them better understand how to optimize model performance through reasonable task - order fine - tuning.