Research on Data Preprocessing Methods for Big Data
Qin KONG,Chang-qing YE,Yun SUN
Abstract:In the era of big data,it is an enormous challenge about data perception,expression,understanding and computing due to the in-herent complexity of data type,organization pattern,different relations and data quality.Data preprocessing is a very important preparation before data analysis and mining.On the one hand,it ensures the correctness and effectiveness of data mining.On the other hand,the ad-justment of the data format and content makes data meet the demand of mining.We analyze the main tasks of data preprocessing and sum-marize several popular processing methods for handling various kinds of"dirty data".The algorithms of data cleaning,integration,trans-formation and reduction are discussed in detail.Using such kinds of preprocessing methods,we can remove redundant and error data,im-prove the incomplete data,promote the required data integration,help data refinement and data consistency of centralized storage.We also can get the minimum and the most reliable data set necessary for the mining system.It also reduces the cost of data mining and improves the accuracy,validity and practicability of knowledge discovery.