More Insight from Being More Focused: Analysis of Clustered Market Apps

Maleknaz Nayebi,Homayoon Farrahi,Ada Lee,Henry Cho,Guenther Ruhe
2024-05-25
Abstract:The increasing attraction of mobile apps has inspired researchers to analyze apps from different perspectives. As with any software product, apps have different attributes such as size, content maturity, rating, category, or number of downloads. Current research studies mostly consider sampling across all apps. This often results in comparisons of apps being quite different in nature and category (games compared with weather and calendar apps), also being different in size and complexity. Similar to proprietary software and web-based services, more specific results can be expected from looking at more homogeneous samples as they can be received as a result of applying clustering. In this paper, we target homogeneous samples of apps to increase the degree of insight gained from analytics. As a proof-of-concept, we applied the clustering technique DBSCAN and subsequent correlation analysis between app attributes for a set of 940 open-source mobile apps from F-Droid. We showed that (i) clusters of apps with similar characteristics provided more insight compared to applying the same to the whole data and (ii) defining the similarity of apps based on the similarity of topics as created from the topic modeling technique Latent Dirichlet Allocation does not significantly improve clustering results.
Software Engineering
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to improve the understanding of the market performance of mobile applications (apps) through cluster analysis. Specifically, it attempts to solve the following key problems: 1. **Improve the accuracy of data analysis**: - Most of the current research conducts analysis by sampling all applications, which often leads to the comparison of applications with large differences in nature and category (for example, game apps vs. weather or calendar apps). Such heterogeneous samples may lead to ambiguous conclusions. - The paper proposes that by using clustering techniques to divide applications into more homogeneous subsets, the accuracy and insight of data analysis can be improved. 2. **Evaluate the impact of clustering on correlation analysis**: - **Research Question 1 (RQ1)**: For correlation analysis, can clustering the original data improve the results? - The author verifies whether clustering can improve the correlation analysis results between market attributes through the DBSCAN clustering technique and subsequent correlation analysis. 3. **Evaluate the impact of similarity definition based on topic modeling on clustering effectiveness**: - **Research Question 2 (RQ2)**: When defining similarity, can including topics extracted from application descriptions significantly improve clustering results (for correlation analysis)? - The author uses Latent Dirichlet Allocation (LDA) to extract topics from application descriptions and uses them as one of the clustering attributes to evaluate their impact on clustering effectiveness. ### Overview of research methods - **Data source**: 940 open - source mobile applications from F - Droid. - **Clustering technique**: Use the DBSCAN algorithm for clustering. - **Attribute selection**: Include market attributes (such as ratings, number of downloads, etc.) and development attributes (such as number of code commits, number of contributors, etc.). - **Topic modeling**: Use LDA to extract topics from application descriptions and use them as additional attributes for clustering. - **Evaluation metrics**: Evaluate the effectiveness of clustering by calculating the change in correlation between attributes before and after clustering, and measure the variance improvement. ### Main findings - **Effect of market - attribute clustering**: Market attributes after clustering show stronger correlations, especially in clustering based on ratings, number of five - star ratings, and number of one - star ratings. - **Impact of topic modeling**: The similarity definition based on topic modeling does not significantly improve clustering results, indicating that simply relying on topics in text descriptions may not be sufficient to improve the quality of clustering. Through these studies, the paper shows how to obtain more valuable market - analysis insights through more focused samples (i.e., homogeneous subsets of applications).