Abstract:To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Does the order of fine - tuning tasks have an impact on the performance of the target task, and why does this impact exist?** Specifically, the paper focuses on the impact of different task orders on the performance of the final target task when fine - tuning through multiple intermediate tasks in software engineering (SE) tasks. Previous research has mainly focused on the field of natural language processing (NLP) and only considered one intermediate task, while this paper systematically studies the impact of multiple intermediate tasks and their order on software engineering tasks for the first time. ### Research Background and Problem Description 1. **Background**: - The development of self - supervised learning has enabled large - scale language models to perform well in various natural language processing tasks. - To further improve the performance of models in downstream tasks, researchers have introduced the method of intermediate - task fine - tuning, that is, fine - tuning the model for one or more intermediate tasks before the target task. - However, the choice of the order of intermediate tasks may lead to different performance results, and may even produce negative transfer, that is, the fine - tuning of intermediate tasks instead reduces the performance of the target task. 2. **Problem Description**: - In the field of software engineering, source code has unique characteristics (such as grammatical constraints and semantic structures) different from natural language, so the findings in the NLP field cannot be directly applied to SE tasks. - This study aims to explore the impact of the fine - tuning order of multiple intermediate tasks on the performance of the target task and analyze the reasons behind it. ### Research Methods To answer the above questions, the paper has carried out the following work: 1. **Experimental Design**: - Four SE tasks were selected: Code Clone Detection (CD), Defect Detection (DD), Code Repair (CR) and Code Translation (CT). - CodeBERT was used as a pre - trained model, and experiments on single - intermediate - task fine - tuning and multiple - intermediate - task fine - tuning for these tasks were carried out. - All possible task - order combinations were evaluated by 10 - fold cross - validation, with a total of 60 fine - tuning - chain models. 2. **Data Analysis**: - **Dataset Feature Analysis**: Including the analysis of grammatical similarity, semantic similarity and dataset size. - **Grammatical Similarity Analysis**: The grammatical similarity between datasets was calculated by comparing keywords, operators and identifiers in code fragments. - **Semmatical Similarity Analysis**: The vector representation of code fragments was generated by CodeBERT, and the semantic similarity was calculated by cosine similarity. - **Dataset Size Analysis**: The impact of dataset size on the performance of the target task was explored. - **Task Feature Analysis**: The knowledge transfer effect between tasks was measured by Task Affinity Analysis. 3. **Results and Discussion**: - The experimental results show that different task orders do have a significant impact on the performance of the target task, with a maximum performance improvement of 6% or a performance decline of 4%. - The paper further analyzes the factors that may cause this impact, including dataset features, task features and model characteristics. ### Conclusions The main contributions of this paper are: 1. For the first time, it systematically studies the impact of the fine - tuning order of multiple intermediate tasks on the performance of software engineering target tasks. 2. Through multi - dimensional analysis, it explains why the task order has an impact on performance. 3. It provides practical suggestions for choosing the most cost - effective task order to achieve the best performance gain. These findings provide valuable references for researchers and practitioners in the field of software engineering, helping them better understand how to optimize model performance through reasonable task - order fine - tuning.

Does the Order of Fine-tuning Matter and Why?

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

AnyTaskTune: Advanced Domain-Specific Solutions through Task-Fine-Tuning

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

Parameter-efficient fine-tuning of large-scale pre-trained language models

Fine-Tuning is Fine, if Calibrated

Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer with Fine-tuning Slow and Fast

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

A Closer Look at How Fine-tuning Changes BERT

Energy and Carbon Considerations of Fine-Tuning BERT

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Different Tunes Played with Equal Skill: Exploring a Unified Optimization Subspace for Delta Tuning

No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Instruction Fine-Tuning: Does Prompt Loss Matter?

Fine-tuned network relies on generic representation to solve unseen cognitive task

P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks