Short text similarity measurement using context‐aware weighted biterms

Shuiqiao Yang,Guangyan Huang,Bahadorreza Ofoghi,John Yearwood
DOI: https://doi.org/10.1002/cpe.5765
2020-04-13
Concurrency and Computation: Practice and Experience
Abstract:<p>With the development of internet technologies, social media and mobile devices, short texts have become an increasingly popular medium among users to communicate with friends, search information and review products. Measuring the similarity between short texts is a fundamental task due to its importance in many applications, such as text retrieval, topic discovery, and event detection. However, short texts generally comprise sparse, noisy, and ambiguous information. Hence, effectively measuring the distance between short texts is a challenging task. In this paper, we exploit the advantageous corpus‐wide word co‐occurrence information into document‐level feature enrichment to mitigate the challenges caused by the sparseness of short texts for distance measurement. We propose a novel context‐aware weighted Biterm method for short text Distance Measurement (BDM). In BDM, we extract biterms (ie, word pairs) from a short text corpus and exploit a biterm topic model to determine the global weights of biterms in the corpus. We then determine the local importance of a biterm in different contexts (ie, short texts) based on the corpus‐level biterm weight. The distance between two short texts is computed using the context‐aware weighted biterms. Experimental results on three real‐world datasets demonstrate better accuracy and effectiveness of the proposed BDM.</p>
What problem does this paper attempt to address?