Theme word subspace method for text document categorization

Zhou Xiaofei,Guo Li,Tan Jianlong,Jiang Wenhan
DOI: https://doi.org/10.1145/2462130.2462136
2012-01-01
Abstract:ABSTRACTIn this paper, a text document categorization method called Theme Word Subspace (TWS) learning is presented, which utilizes theme words jointly express class-semantic information for document classification. In a class corpus, the theme words with high probability distribution in topic structure are extracted firstly, and then these words as important theme element span class subspaces to jointly represent semantic and distribution of the class. For document categorization processing, a text document is belonged to the nearest subspace whose theme words have the best representation for test document. In our TWS, L1, L2 norm are separately used for measuring the distances of a test document to subspaces. Experiments on a large Chinese text corpus, the proposed TWS learning methods exhibit comparable performances for text document category.
What problem does this paper attempt to address?