AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

Zhixu Li,Pei Li,Wei Wei,Hongyan Liu,Jun He,Tao Liu,Xiaoyong Du
DOI: https://doi.org/10.1007/978-3-642-00672-2_33
2009-01-01
Abstract:Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don’t improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases. Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.
What problem does this paper attempt to address?