Efficient text representation via weighted co-contributions of terms on class tendency

Yuan Ping,Yajian Zhou,Huina Li,Yixian Yang
2011-01-01
Abstract:Text representation dominates the performance of text categorization. In this paper, an effective text representation scheme, employing the document structure information based on passage and a novel term weighting factor, is proposed to integrate the contributions from terms to each category. Firstly, with a self-adaptive adjustment strategy, a document is split into passages to extract the distributional coefficient in which the terms locating are of equal importance. Then a novel term weighting factor is designed to keep the probabilities of class tendencies. Finally, to avoid the high-dimensionality of document vector, the proposed scheme uses a weighted combination of terms based on passages to vectorize a document in much lower dimension while minimizing the loss of information. By employing the support vector machine (SVM) classifier, experiments on four corpora show that the proposed scheme could achieve comparable accuracy in comparison with the state-of-the-art methods while reaching a significant improvement on efficiency. Further analysis also confirms another promising direction of exploring structure information for further improvement and special use. © 2011 ICIC International.
What problem does this paper attempt to address?