Multi-Level Topical Text Categorization with Wikipedia

Nan Guo,Yuan He,ChunGang Yan,Lu Liu,Cheng Wang
DOI: https://doi.org/10.1145/2996890.3007856
2016-01-01
Abstract:This paper introduces an automatic categorical-marking model for text categorization. Traditional classification algorithms are generally applying labeled training set and call for a lot of manual work to tag classifications beforehand. Also due to the ambiguity and fuzziness of texts, the results of traditional text categorization algorithms may not be clear enough and abundant in content. This paper presents an unsupervised, training-set-free and hierarchical categorization model called Folk-Topical Text Categorization (FTTC). FTTC applies topic model to abstract documents to topical words and make use of Wikipedia's crowd-sourcing and collective control to extend hierarchical classifications. The results are not restricted to predefined categories but contain categories abstracted to deeper semantic levels and greatly facilitate traditional text categorization applications. For a document, its topical words are obtained using a popular topic model called Latent Dirichlet Allocation (LDA). Afterwards, the topical words are used to build and trace through the category-trees of Wikipedia. Based on the filtered results, the final classifications comprehensively reflect the diversified and content-rich information of the text, and fully cover different aspects of the text. Experimental results on different kinds of datasets show that our model advances in classification accuracy, flexibility and intelligibility, as compared with traditional models.
What problem does this paper attempt to address?