Applying machine learning techniques to predict and explain subscriber churn of an online drug information platform
Georgios Theodoridis,Athanasios Tsadiras
DOI: https://doi.org/10.1007/s00521-022-07603-9
2022-07-20
Neural Computing and Applications
Abstract:Presently, most markets are extremely saturated and, as a result, businesses are highly competitive. Hence, avoiding the loss of preexisting customers is pivotal, deeming the prediction of customer loss crucial to efficiently target potential churners and attempt to retain them. This study provides an in-depth comparison of various machine learning techniques and advanced preprocessing methods as well as an overall guide for handling churn prediction problems. Churn prediction is fundamentally a binary classification problem. To handle said problem, within this paper, numerous methods that belong to different machine learning categories (linear, nonlinear, ensemble, neural networks) are constructed, optimized and trained on the subscription data of a new real-world dataset originating from a popular online drug information platform that provides information on drugs and drug substances as well as professional tools for pharmacotherapy decision making. In contrast with previous works that address traditional customer churn in relation to telecom, banking or insurance industries, the current study addresses online subscriber churn where users might churn at any given moment. This study also focuses on the proper preprocessing of the given data via advanced machine learning methods, as well as evaluating the models under different conditions to measure their robustness. The results are presented, compared, analyzed and explained. Extensive feature importance analysis is performed to explain not only the models themselves but to also indicate the main factors that contribute toward churning. The findings co-align with the notion that, under the important condition that the dataset is preprocessed using not only statistical methods but machine learning techniques as well, all methods perform adequately and are generally viable options, but ensemble methods, namely Random Forests, are more flexible and resistant toward outliers. Feature importance analysis indicates that usage, not demographic data, is the prime indicator of churn.
computer science, artificial intelligence