A New Text Feature Extraction Model and Its Application in Document Copy Detection

JP Bao,JY Shen,XD Liu,QB Song
DOI: https://doi.org/10.1109/icmlc.2003.1264447
2003-01-01
Abstract:Text feature extraction is a common issue in information retrieval, text mining, Web mining, text classification/clustering and document copy etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But that is only global semantic feature of a document and loses local feature and structural information so that it prevents us to distinguish text well, especially in copy detection. In this paper we present a new text feature extraction model: semantic sequence model (SSM) that based on the concepts of word distance, word density and semantic sequence. The semantic sequences of a document contain not only local semantic features but also global feature and structural information, on which we get excellent accuracy of text copy detection. At the end of the paper, we contrast SSM with VSM and RFM and the experimental results show SSM is a superior model.
What problem does this paper attempt to address?