Fast Document Cosine Similarity Self-Join on GPUs.
Yilin Feng,Jie Tang,Meilin Liu,Chongjun Wang,Junyuan Xie
DOI: https://doi.org/10.1109/ictai.2018.00040
2018-01-01
Abstract:Similarity Search has been studied in many different fields of computer science, including data mining, information retrieval, databases and so on. Document similarity self-join is a crucial part of lots of applications, such as near-duplicate document detection, document clustering and web search. On a collection of documents, document similarity self-join finds out all pairs of documents whose similarity values are no lower than a threshold value. However, similarity search is a computation intensive procedure and consumes a large amount of time as the dataset size increases. Thus, many serial algorithms focus on speeding up the process by decreasing the possible similarity candidates for each query object on high-dimensional sparse datasets, including documents. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Parallel implementations based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, taking into account features of document datasets, we propose 2Step-SSJ, which solves the document similarity self-join in CUDA environment on GPUs. 2Step-SSJ performs the similarity self-join in two steps, i.e., similarity computing on the inverted list and similarity computing on the forward list, which compromises between the memory visiting and dot product computation. The experimental results show that 2StepSSJ could solve the problem much faster than existing methods on three benchmark text corpora, achieving the speedup of 2x-23x against the state-of-the-art parallel algorithm in general, while keep a relatively stable running time with different values of the threshold.