Document Distance Estimation Via Code Graph Embedding.

Zeqi Lin,Junfeng Zhao,Yanzhen Zou,Bing Xie
DOI: https://doi.org/10.1145/3131704.3131713
2017-01-01
Abstract:Accurately representing the distance between two documents (i.e. pieces of textual information extracted from various software artifacts) has far-reaching applications in many automated software engineering approaches, such as concept location, bug location and traceability link recovery. This is a challenging task, since documents containing different words may have similar semantic meanings. In this paper, we propose a novel document distance estimation approach. This approach captures latent semantic associations between documents through analyzing structural information in software source code: first, we embed code elements as points in a shared representation space according to structural dependencies between them; then, we represent documents as weighted point clouds of code elements in the representation space and reduce the distance between two documents to an earth mover's distance transportation problem. We define a document classification task in StackOverflow dataset to evaluate the effectiveness of our approach. The empirical evaluation results show that our approach outperforms several state-of-the-art approaches.
What problem does this paper attempt to address?