Hamming Distance Based Approximate Similarity Text Search Algorithm

Haifeng Hu,Liang Zhang,Jianshen Wu
DOI: https://doi.org/10.1109/icaci.2015.7184772
2015-01-01
Abstract:We propose a Hamming distance based approximate similarity text search (HASTS) algorithm to improve the quality of queries in massive text data. The HASTS algorithm first constructs an index table with the substrings extracted randomly from the feature fingerprints generated by the SimHash algorithm. Then, it assigns weights to text terms to reduce the size of the candidate set. Furthermore, the final query result can be obtained by comparing the Hamming distance between the query term and the text terms in the candidate set. Finally, Extensive simulations are conducted to analysis the influence of different parameters on query performance of the HASTS algorithm and compare its performance with the existing search algorithm. The results show that the HASTS algorithm can satisfy the query requirements in massive text data with relatively low overheads.
What problem does this paper attempt to address?