MapReduce based distributed improved random forest model for graduates career classification

Fei QIAO,Yanhao GE,Weichang KONG
DOI: https://doi.org/10.12011/1000-6788(2017)05-1383-10
2017-01-01
Abstract:Educational data mining is a research area of using data mining technology in education industry.In the research of EDM,data mining technology is used to modeling dataset samples in the field of education,which aims to study and forecast the testing data set with the help of effective statistical machine learning models.Machine learning models with distributed computing frameworks in the EDM can meet the needs of large-scale data processing meanwhile provide tailored data recommendation and then support decision-making in the future.Based on this background,this study first put all kinds of data models into the data training and predicting for simulation,propose an improved model to ameliorate the classification performance of the data model by adjusting the data model and by using an improved algorithm based on a new equation of information gain when calculating the optimal feature to split.Based on the best-performance data model in previous study combined with the application background of the "big data" era,we proposed a new random forest algorithm model focusing on giving classification to largescale datasets based on distributed computing framework called MapReduce.By using the MapReduce,we design and realize a new system to meet this requirement.In this system,the model that has been trained can be serialized and deserialization between local disks and the distributed file system.
What problem does this paper attempt to address?