Applying Naive Bayes Classification to Google Play Apps Categorization

Babatunde Olabenjo
DOI: https://doi.org/10.48550/arXiv.1608.08574
2016-08-31
Abstract:There are over one million apps on Google Play Store and over half a million publishers. Having such a huge number of apps and developers can pose a challenge to app users and new publishers on the store. Discovering apps can be challenging if apps are not correctly published in the right category, and, in turn, reduce earnings for app developers. Additionally, with over 41 categories on Google Play Store, deciding on the right category to publish an app can be challenging for developers due to the number of categories they have to choose from. Machine Learning has been very useful, especially in classification problems such sentiment analysis, document classification and spam detection. These strategies can also be applied to app categorization on Google Play Store to suggest appropriate categories for app publishers using details from their application. In this project, we built two variations of the Naive Bayes classifier using open metadata from top developer apps on Google Play Store in other to classify new apps on the store. These classifiers are then evaluated using various evaluation methods and their results compared against each other. The results show that the Naive Bayes algorithm performs well for our classification problem and can potentially automate app categorization for Android app publishers on Google Play Store
Machine Learning,Information Retrieval
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to use the Naive Bayes classification method in machine learning to accurately classify applications in the Google Play store, so as to help developers choose the correct category to publish their applications. This can not only improve the user experience of finding applications, but also increase the download volume and revenue of developers' applications. Specifically, the paper solves the following problems: 1. **Application classification problem**: - There are more than 1 million applications and more than 500,000 publishers in the Google Play store. Due to the large number of applications, developers face challenges in choosing the correct category. - If the applications are not correctly classified, it may be difficult for users to find these applications, thereby reducing the download volume of the applications and the revenue of the developers. 2. **Automated classification suggestions**: - The paper proposes to use the Naive Bayes classifier to provide automated classification suggestions for new applications based on the application data provided by existing successful developers. - By extracting the metadata of the application (such as application name, content rating, whether it is free, whether there are in - app purchases, description, etc.), and using the TF - IDF (term frequency - inverse document frequency) statistical method, a classification model is constructed. 3. **Improving classification accuracy**: - The paper compares the performance of two Naive Bayes classifiers (Multinomial Naive Bayes and Bernoulli Naive Bayes), and verifies their effectiveness through multiple evaluation methods (such as k - fold cross - validation, confusion matrix, F1 - score, etc.). - The experimental results show that the Multinomial Naive Bayes classifier performs better when dealing with all categories, and the classification accuracy is further improved after merging game - type applications into one large category. ### Formula display The formulas involved in the paper are as follows: 1. **Bayes' theorem**: \[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \] where: - \(P(A|B)\) is the posterior probability, which represents the probability that event \(A\) occurs given that event \(B\) has occurred. - \(P(B|A)\) is the likelihood, which represents the probability that event \(B\) occurs given that event \(A\) has occurred. - \(P(A)\) is the prior probability, which represents the probability that event \(A\) occurs. - \(P(B)\) is the marginal probability, which represents the probability that event \(B\) occurs. 2. **Maximum a posteriori estimation (MAP)**: \[ c_{\text{MAP}}=\argmax_{c\in C}(P(c|d)) \] \[ c_{\text{MAP}}=\argmax_{c\in C}(P(c)\prod_{1\leq k\leq n_d}P(t_k|c)) \] 3. **Log - likelihood estimation**: \[ c_{\text{map}}=\argmax_{c\in C}(\log P(c)+\sum_{1\leq k\leq n_d}\log P(t_k|c)) \] 4. **Laplace smoothing**: \[ P(t|c)=\frac{T_{ct}+1}{\sum_{t'\in V}(T_{ct'}+1)} \] 5. **TF - IDF calculation**: \[ w_n = \text{TF}_n\times\log(\text{IDF}_n) \] where: - \(\text{TF}_n\) is the term frequency of the \(n\) - th word in document \(D\). - \(\text{IDF}_n\) is the inverse document frequency of the \(n\) - th word, which is expressed as: