Context-Preserving Hashing for Fast Text Classification.

Lianhua Chi,Bin Li,Xingquan Zhu
DOI: https://doi.org/10.1137/1.9781611973440.12
2014-01-01
Abstract:There have been a number of approximate algorithms for text similarity computation, such as min-wise hashing, random projection, and feature hashing, which are based on the bag-of-words representation. A limitation of their “flat-set” representation is that context information and semantic hierarchy cannot be preserved. In this paper, we aim to fast compute similarities between texts while also preserving context information. To take into account semantic hierarchy, we consider a notion of “multi-level exchangeability” which can be applied at word-level, sentence-level, paragraph-level, etc. We employ a nested-set to represent a multi-level exchangeable object. To fingerprint nested-sets for fast comparison, we propose a Recursive Min-wise Hashing (RMH) algorithm at the same computational cost of the standard min-wise hashing algorithm. Theoretical study and bound analysis confirm that RMH is a highly-concentrated estimator. The empirical studies show that the proposed context-preserving hashing method can significantly outperform min-wise hashing and feature hashing in accuracy at the same (or less)
What problem does this paper attempt to address?