CAREER: Context-Aware API Recognition with Data Augmentation for API Knowledge Extraction

Zhang,Xinjun Mao,Shangwen Wang,Kang Yang,Yao Lu
DOI: https://doi.org/10.1145/3643916.3644431
2024-01-01
Abstract:The recognition of Application Programming Interface (API) mentions in the software-related texts is a prerequisite task for extracting API-related knowledge. Previous studies have demonstrated the superiority of deep learning-based methods in accomplishing this task. However, such techniques still meet their bottlenecks due to their inability to effectively handle the following three challenges: (1) differentiating APIs from common words; (2) identifying APIs in morphological variants of the standard APIs; and (3) the lack of high-quality labeled data for training. To overcome these challenges, this paper proposes a context-aware API recognition method named CAREER. This approach utilizes two key components, namely Bidirectional Encoder Representations from Transformers (BERT) and Bi-directional Long Short-Term Memory (BiLSTM), to extract context information at both the word-level and sequence-level. This strategic combination empowers the method to dynamically capture both syntactic and semantic information, effectively addressing the first challenge. To tackle the second challenge, CAREER introduces a character-level BiLSTM component, enriched with an attention mechanism. This enables the model to grasp character-level global context information, thereby enhancing the recognition of morphological attributes within API mentions. Furthermore, to address the third challenge, the paper introduces three data augmentation techniques aimed at generating new data samples. Accompanying these techniques is a novel sample selection algorithm designed to screen out high-quality instances. This dual-pronged approach effectively mitigates the requirement for data labeling. Experiments demonstrate that CAREER significantly improves F1-score by 11.0% compared with state-of-the-art methods. We also construct specific datasets to assess CAREER's capacity to tackle the aforementioned challenges. Results confirm that (1) CAREER significantly outperforms baseline methods in addressing the first and second challenges, and (2) with the aid of data augmentation techniques and sample selection algorithms, high-quality samples can be generated to improve the performance, and alleviate the third challenge.
What problem does this paper attempt to address?