Research and implementation of user clustering based on MapReduce in multimedia big data

Tongke Fan
DOI: https://doi.org/10.1007/s11042-017-4825-4
IF: 2.577
2017-06-01
Multimedia Tools and Applications
Abstract:Poor understanding and low clustering efficiency of massive data is a problem under the context of big data. To solve this problem, Canopy + K-means clustering algorithm is proposed, and the MapReduce programming model is used to make full use of the computing and storage capacity of Hadoop cluster. Large quantities of buyers on taobao are taken as application context to do case study through Hadoop platform’s data mining set Mahout. General procedure for miming with Mahout is also given. Clustering algorithm based on MapReduce shows preferable clustering quality and operation speed. Comparison is made between Canopy + K-means algorithm and K-means algorithm in respect of runtime, speed-up ratio and extendibility. Test is conducted for these two clustering algorithms on clusters with different numbers of nodes in context of dataset of various scales. The experimental results show that Canopy + K-means algorithm has faster operation speed than K-means algorithm, but both of them show good speed-up ratio under Hadoop environment and Canopy + K-means algorithm is even much better K-means algorithm.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?