Measuring and Predicting the Relevance Ratings Between FLOSS Projects Using Topic Features

Zhiwen Zheng,Liang Wang,Jingwei Xu,Tianheng Wu,Simeng Wu,Xianping Tao
DOI: https://doi.org/10.1145/3275219.3275222
2018-01-01
Abstract:Understanding the relevance between the Free/Libra Open Source Software projects is important for developers to perform code and design reuse, discover and develop new features, keep their projects up-to-date, and etc. However, it is challenging to perform relevance ratings between the FLOSS projects mainly because: 1) beyond simple code similarity, there are complex aspects considered when measuring the relevance; and 2) the prohibitive large amount of FLOSS projects available. To address the problem, in this paper, we propose a method to measure and further predict the relevance ratings between FLOSS projects. Our method uses topic features extracted by the LDA topic model to describe the characteristics of a project. By using the topic features, multiple aspects of FLOSS projects such as the application domain, technology used, and programming language are extracted and further used to measure and predict their relevance ratings. Based on the topic features, our method uses matrix factorization to leverage the partially known relevance ratings between the projects to learn the mapping between different topic features to the relevance ratings. Finally, our method combines the topic modeling and matrix factorization technologies to predict the relevance ratings between software projects without human intervention, which is scalable to a large amount of projects. We evaluate the performance of the proposed method by applying our topic extraction and relevance modeling methods using 300 projects from GitHub. The result of topic extraction experiment shows that, for topic modeling, our LDA-based approach achieves the highest hit rate of 98.3% and the highest average accuracy of 29.8%. And the relevance modeling experiment shows that our relevance modeling approach achieves the minimum average predict error of 0.093, suggesting the effectiveness of applying the proposed method on real-world data sets.
What problem does this paper attempt to address?