Improving Source Code Pre-training Via Type-Specific Masking
Wentao Zou,Qi Li,Chuanyi Li,Jidong Ge,Xiang Chen,LiGuo Huang,Bin Luo
DOI: https://doi.org/10.1145/3699599
IF: 3.685
2024-01-01
ACM Transactions on Software Engineering and Methodology
Abstract:The masked language modeling (MLM) task is widely recognized as one of the most effective pre-training tasks and currently derives many variants in the software engineering (SE) field. However, most of these variants mainly focus on code representation without distinguishing between different code token types, while some focus on a specific type, such as code identifiers. Indeed, various code token types exist, and there is no evidence that only identifiers can improve PTMs. Thus, to improve PTMs through different types, we conducted an extensive study to evaluate how different type-specific masking tasks can affect PTMs. First, we extract five code token types, convert them into type-specific masking tasks, and generate their combinations. Second, we pre-train CodeBERT and PLBART using combinations and fine-tuned them on four SE downstream tasks. Experimental results show that type-specific masking tasks can enhance CodeBERT and PLBART on all downstream tasks. Furthermore, we discuss topics related to low-resource datasets, conflicting PTMs that original pre-training tasks conflict with our methods, the cost and performance of our methods, factors that impact the performance of our methods, and applying our methods on state-of-the-art PTMs. These discussions comprehensively analyze the strengths and weaknesses of different type-specific masking tasks.