Design And Implement Of Distributed Document Clustering Based On Mapreduce

Jian Wan,Wenming Yu,Xianghua Xu
2009-01-01
Abstract:In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, we improved the efficiency and effectiveness of the algorithm. Finally, we give the results and some related discussion.
What problem does this paper attempt to address?