A Disk-Based Algorithm for Fast Outlier Detection in Large Datasets

ZHAO Fa-Xin,BAO Yu-bin,SUN Huan-liang,YU Ge,WANG Da-ling
DOI: https://doi.org/10.4018/978-1-59904-120-9.ch002
2007-01-01
Abstract:In data mining fields, outlier detection is an important research issue. The number of cells in the cell-based disk algorithm increases exponentially. The performance of this algorithm will decrease dramatically with the increasing of the number of cells and data points. Through further analysis, we find that there are many empty cells that are useless to outlier detection. So this chapter proposes a novel index structure, called CD-Tree, in which only non-empty cells are stored, and a cluster technique is adopted to store the data objects in the same cell into linked disk pages. Some experiments are made to test the performance of the proposed algorithms. The experimental results show that the performance of the CD-Tree structure and of the cluster technique based disk algorithm outperforms that of the cell-based disk algorithm, and the dimensionality processed by the proposed algorithm is higher than that of the old one.
What problem does this paper attempt to address?