Data Cleaning In Data Warehouse: A Survey of Data Pre-processing Techniques and Tools

Muhammad Gufran Khan,Nosheen Nazir,Anosh Fatima
DOI: https://doi.org/10.5815/IJITCS.2017.03.06
2017-03-08
Abstract:A Data Warehouse is a computer system designed for storing and analyzing an organization's historical data from day-to-day operations in Online Transaction Processing System (OLTP). Usually, an organization summarizes and copies information from its operational systems to the data warehouse on a regular schedule and management performs complex queries and analysis on the information without slowing down the operational systems. Data need to be pre-processed to improve quality of data, before storing into data warehouse. This survey paper presents data cleaning problems and the approaches in use currently for preprocessing. To determine which technique of preprocessing is best in what scenario to improve the performance of Data Warehouse is main goal of this paper. Many techniques have been analyzed for data cleansing, using certain evaluation attributes and tested on different kind of data sets. Data quality tools such as YALE, ALTERYX, and WEKA have been used for conclusive results to ready the data in data warehouse and ensure that only cleaned data populates the warehouse, thus enhancing usability of the warehouse. Results of paper can be useful in many future activities like cleansing, standardizing, correction, matching and transformation. This research can help in data auditing and pattern detection in the data.
Computer Science
What problem does this paper attempt to address?