Improving Fine-tuning Pre-trained Models on Small Source Code Datasets Via Variational Information Bottleneck.

Jiaxing Liu,Chaofeng Sha,Xin Peng
DOI: https://doi.org/10.1109/saner56733.2023.00039
2023-01-01
Abstract:Small datasets are common in software engineering tasks such as linguistic smell detection and code runtime complexity prediction, as crafting these datasets often involves expert knowledge. Prior work usually applies machine learning algorithms (e.g., logistic regression and SVM) with hand-crafted features to tackle them, which could outperform neural models such as CNN. Recently, researchers have employed fine-tuning large pre-trained code models on various code-related tasks thanks to their transferability. However, it might be still instable and overfitting when fine-tuning on small datasets. In this paper, we firstly conduct an empirical study to fine-tune CodeBERT(a) on four code-related small datasets and observe the instability phenomenon. This could be induced by over-capacity and irrelevant features inherent in these large pre-trained code models with respective to those small datasets. To address this issue, we leverage variational information bottleneck to filter out irrelevant features when fine-tuning the models. The experiments demonstrate the out-performance of our method compared to standard fine-tuning and regularization method such as dropout and weight decay. We also experimentally study the stability of our method through varying dataset sizes. Our code and data are available at https://github.com/little-pikachu-hash/VIBCodeBERT.
What problem does this paper attempt to address?