An Ensemble Random Forest Algorithm for Insurance Big Data Analysis

Ziming Wu,Weiwei Lin,Zilong Zhang,Angzhan Wen,Longxin Lin
DOI: https://doi.org/10.1109/cse-euc.2017.99
2017-07-01
Abstract:Due to the imbalanced distribution of business data, missing of user features and many other reasons, directly using big data techniques on realistic business data tends to deviate from the business goals. It is difficult to model the insurance business data by classification algorithms like Logistic Regression and SVM etc. This paper exploits a heuristic bootstrap sampling approach combined with the ensemble learning algorithm on the large-scale insurance business data mining, and proposes an ensemble random forest algorithm which used the parallel computing capability and memory-cache mechanism optimized by Spark. We collected the insurance business data from China Life Insurance Company to analyze the potential customers using the proposed algorithm. Experiment result shows that the ensemble random forest algorithm outperformed SVM and other classification algorithms in both performance and accuracy within the imbalanced data.
What problem does this paper attempt to address?