AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects

Kexin Sun,Yiding Ren,Hongyu Kuang,Hui Gao,Xiaoxing Ma,Guoping Rong,Dong Shao,He Zhang
2024-09-28
Abstract:Traceability plays a vital role in facilitating various software development activities by establishing the traces between different types of artifacts (e.g., issues and commits in software repositories). Among the explorations for automated traceability recovery, the IR (Information Retrieval)-based approaches leverage textual similarity to measure the likelihood of traces between artifacts and show advantages in many scenarios. However, the globalization of software development has introduced new challenges, such as the possible multilingualism on the same concept (e.g., "ShuXing" vs. "attribute") in the artifact texts, thus significantly hampering the performance of IR-based approaches. Existing research has shown that machine translation can help address the term inconsistency in bilingual projects. However, the translation can also bring in synonymous terms that are not consistent with those in the bilingual projects (e.g., another translation of "ShuXing" as "property"). Therefore, we propose an enhancement strategy called AVIATE that exploits translation variants from different translators by utilizing the word pairs that appear simultaneously across the translation variants from different kinds artifacts (a.k.a. consensual biterms). We use these biterms to first enrich the artifact texts, and then to enhance the calculated IR values for improving IR-based traceability recovery for bilingual software projects. The experiments on 17 bilingual projects (involving English and 4 other languages) demonstrate that AVIATE significantly outperformed the IR-based approach with machine translation (the state-of-the-art in this field) with an average increase of 16.67 in Average Precision (31.43%) and 8.38 (11.22%) in Mean Average Precision, indicating its effectiveness in addressing the challenges of multilingual traceability recovery.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the multilingual term inconsistency problem encountered by information retrieval (IR) techniques in restoring traceability in bilingual software projects. Specifically: 1. **Challenges Brought by the Multilingual Development Environment**: With the development of globalization, software development gradually involves multiple languages. Developers may mix different languages when writing code, submitting issues or writing comments, which leads to inconsistent expressions of the same concept in different languages (for example, "属性" has two translations in English: "attribute" and "property"). This term inconsistency seriously affects the effectiveness of text - similarity - based information retrieval methods. 2. **Limitations of Existing Solutions**: Although machine translation can help solve part of the term inconsistency problem, different translation tools may give different translation results, and these results may not conform to the actual usage habits of the project. Therefore, the existing machine - translation - based methods still have performance bottlenecks. To solve the above problems, the author proposes a new method named AVIATE. By using translation variants generated by multiple translation tools to capture cross - language consistent term pairs (i.e., "consensual biterms"), it improves the ability of IR models to restore traceability in bilingual projects. The specific steps include: - **Pre - process Data**: Clean and standardize the input text content. - **Translation by Multiple Translation Tools**: Use four mainstream translation tools (NLLB - 1.3B, M2M - 100 - 12B, Google Translate, Tencent Translate) to translate non - pure - English sentences into pure - English sentences. - **Extract Consensual Biterms**: Extract biterm pairs that appear simultaneously in different types of artifacts (such as issues and commit records) from multiple translation versions. - **Select Distinctive Consensual Biterms**: Screen out the most representative consensual biterms by calculating the Inverse Translation Variant Frequency (ITVF). - **Adjust Weight Factors**: Adjust the number of repetitions of consensual biterms in the text according to their uniqueness to highlight the importance of key information. Through these steps, AVIATE can effectively alleviate the impact of translation inconsistency and significantly improve the IR - based traceability restoration effect in bilingual projects. The experimental results show that AVIATE improves the average precision (AP) and mean average precision (MAP) by 31.43% and 11.22% respectively compared with existing methods, proving its effectiveness.