Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

Bryan Gregory
DOI: https://doi.org/10.48550/arXiv.1802.03396
2018-02-10
Abstract:Accurately predicting customer churn using large scale time-series data is a common problem facing many business domains. The creation of model features across various time windows for training and testing can be particularly challenging due to temporal issues common to time-series data. In this paper, we will explore the application of extreme gradient boosting (XGBoost) on a customer dataset with a wide-variety of temporal features in order to create a highly-accurate customer churn model. In particular, we describe an effective method for handling temporally sensitive feature engineering. The proposed model was submitted in the WSDM Cup 2018 Churn Challenge and achieved first-place out of 575 teams.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to accurately predict customer churn using large - scale time - series data**. Specifically, the author explores how to effectively apply the Extreme Gradient Boosting (XGBoost) model to build a high - precision customer churn prediction model when dealing with time - sensitive features. ### Problem Background: 1. **The Importance of Customer Churn Prediction**: - Accurately predicting customer churn is crucial for the long - term success of many enterprises. It affects multiple aspects of the enterprise, such as proactive customer marketing, sales forecasting, and churn - based pricing models. - Even a slight improvement in prediction accuracy may significantly increase the enterprise's profit. 2. **Existing Challenges**: - Dealing with time - sensitive features in time - series data is a major challenge. Especially when training, cross - validating, and testing machine - learning models, it is very important to ensure that all features correctly consider time offsets. - The selection of time windows and feature engineering methods has an important impact on model performance. ### Solutions: - **Using the XGBoost Model**: - XGBoost is a modern machine - learning library, which is suitable for processing high - dimensional data and creating very accurate models. - **Time - Sensitive Feature Engineering**: - The paper proposes an effective method to deal with time - sensitive features, including two main methods: 1. **Relative Refactoring Method**: - Map date - driven features to a new feature space relative to a selected time point to ensure that the features are comparable in different time periods. - The formula is: \[ \text{New Feature} = \text{Data Element} - \text{Static Time Point} \] - For example, calculate "the number of days since registration" or "the number of days since the last login". 2. **Absolute Method**: - Convert the original date field into an integer form and directly use it as the model input. - It is suitable for cases where the absolute date/time contains more signals, such as the case where a user registers after a specific holiday. - **Experimental Setup**: - The data set is from the KKBOX music streaming service, covering user activity logs, transaction information, and member data. - The data is divided into three time periods: the training set (January 2017), the cross - validation set (February 2017), and the test set (March 2017). - The model performance is evaluated by Log Loss. ### Results: - The final model won the first place in the WSDM Cup 2018 Churn Challenge. A total of 575 teams participated, and the final Log Loss score was 0.07974. ### Summary: The paper shows that through carefully designed time - sensitive feature engineering and the use of the XGBoost model, the accuracy of customer churn prediction can be significantly improved. This not only proves the importance of feature engineering but also provides a valuable reference for future research.