Large-scale document similarity computation based on cloud computing platform

Chaobo He,Yong Tang,Feiyi Tang,Atiao Yang
DOI: https://doi.org/10.1109/ICPCA.2011.6106499
2011-01-01
Abstract:Low efficiency existing in the current approaches for large scale document similarity computation, to make an improvement we pinpointed a new approach based on cloud computing platform in this paper. The approach carried out document similarity computation based on traditional vector model space as well as applied MapReduce computation model to realize the parallelization of distributed inverted index and similarity computation. In this paper we first discussed the traditional approaches' disadvantages, and then presented the structure of distributed inverted index, the architecture of cloud computing platform and the core algorithms based on MapReduce computation model. Last we made some related experiments. Using this approach, large scale document similarity computation can be run more effectively and had more scalability as well.
What problem does this paper attempt to address?