Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

Chunying Zhou,Xiaoyuan Xie,Gong Chen,Peng He,Bing Li
2024-09-19
Abstract:Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.
Software Engineering,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve several key problems in software fault localization, especially the challenges faced by fault localization methods based on information retrieval (IR - based) techniques. Specifically, the author points out the following deficiencies in existing methods: 1. **Ignoring useful auxiliary information**: Existing information - retrieval - based methods often ignore some useful information that may help improve localization performance when constructing representations of error reports and source code files and matching their semantic vectors, such as: - The interaction relationship between error reports and source code files. - The similarity relationship between error reports. - The co - reference relationship between source code files. 2. **Impact of text quality**: The performance of information - retrieval - based methods is usually affected by the text quality of error reports. When the text description provided by the error report is insufficient, it is difficult to obtain satisfactory performance even with very complex models. To solve the above problems, the author proposes a new method, called Multi - View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL - IRFL). This method aims to improve fault localization in the following ways: - **Constructing a multi - view structure**: Generate data augmentation from three different perspectives (report - code interaction view, report - report similarity view, code - code co - reference view), and use Graph Neural Network (GNN) to aggregate information of error reports or source code files during the embedding process. - **Contrastive learning**: Conduct contrastive learning across these views, design contrastive learning tasks to force error report representations to encode information shared by the report - report and report - code views, and source code file representations to encode information shared by the code - code and report - code views, thereby reducing the impact of noise in auxiliary information. Through this method, MACL - IRFL can use historical repair records and other auxiliary information in the prediction stage to make up for the lack of repair history records in new error reports, and effectively suppress the noise problem caused by auxiliary information overload. Experimental results show that this method significantly outperforms the existing best - performing baseline methods on multiple evaluation metrics.