Introduction to Special Issue on Large-Scale Data Mining

Jimeng Sun,Yan Liu,Jie Tang,C. Apté
DOI: https://doi.org/10.1145/1921632.1921633
2011-02-01
Abstract:Large-scale data mining focuses on the main premise of all practical data mining methods, that is, the scalability and efficiency of the method. With the exponential growth of data volume, we argue that scalability often becomes the top requirement to make any data mining method relevant in practice. The goal of this special issue is to promote the theory and applications of large-scale data mining. In particular, we attracted 19 articles, which were reviewed by three anonymous referees through two rounds of reviews. Finally, we selected six articles for inclusion in this special issue. The selected articles cover various areas such as systems, algorithms, and a comparison survey. From the systems aspect, we selected two articles. “HADI: Mining Radii of Large Graphs.” by U. Kang et al. proposes an extremely scalable method for computing the diameter of the large graph using MapReduce. The authors evaluated their system on real-world graphs including one with 6-billion edges, 1/8 of a terabyte of storage. “Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters” by Timothy De Vries et al. studies the large-scale record linkage problem, which is of great practical significance in data mining and data management applications. In particular, the authors combine suffix array and bloom filters in order to achieve great scalability and accuracy. From the algorithmic aspect, we selected four articles. “Temporal Link Prediction using Matrix and Tensor Factorizations” by Daniel Dunlavy et al. studies the temporal link prediction on graphs using scalable and sparse matrix and tensor factorization. “Enhancing Clustering Quality through Landmark-based Dimensionality Reduction” by Panagis Magdalinos et al. proposes a landmark-based dimensionality reduction, which can improve the scalability of the subsequent clustering methods and the clustering quality. “Clustering Large Attributed Graphs: A Balance Between Structural and Attribute Similarities” by Hong Cheng et al. studies a special type of graph where each node in the graph is associated with certain attributes. The authors propose an efficient clustering algorithm using these graphs. Finally, “Fast Algorithms for Approximating the Singular Value Decomposition” by Aditya Krishna Menon and Charles Elkan surveys a number of fast low-rank approximation algorithms which have been used in many data mining methods and applications. The authors quantitatively compare the performance of those algorithms in various aspects. The result provides a great practical guideline for selecting the right algorithm for doing fast low-rank approximation.
Computer Science
What problem does this paper attempt to address?