Effectively Classifying Short Texts by Structured Sparse Representation with Dictionary Filtering

Longwen Gao,Shuigeng Zhou,Jihong Guan
DOI: https://doi.org/10.1016/j.ins.2015.06.033
IF: 8.1
2015-01-01
Information Sciences
Abstract:Short text classification (STC) has attracted increasing interest recently with the rapid growth of Web and social media data existing in short text form. It is a more challenging task than traditional text classification (TC) because of the feature sparsity of the processed short texts, which makes the state of the art TC approaches perform poorly on short texts if being applied straightforwardly. Existing STC approaches deal with the sparse problem mainly by enriching text content with outer corpora or additional information. Though better performance can be obtained, the performance heavily relies on the amount and quality of outer or additional information. What is worse, such outer or additional information is not always available, not to mention the high cost for acquiring such information. In this paper, we introduce a structured sparse representation classifier to effectively classify short texts, and develop an effective approach called convex hull vertices selection to reduce data correlation and redundancy of the dictionary (the set of training texts), which thus substantially boosts STC efficiency and performance. To the best of our knowledge, this is the first work that exploits structured sparsity for STC. Experiments over five datasets show that the proposed approach outperforms the state of the art TC methods in classification effectiveness and the traditional SR classifier in both classification effectiveness and classification efficiency. Furthermore, we carry out an experiment to classify short texts expanded by additional content, which indirectly shows that our approach performs better than the existing SIC methods that exploit external text sources. (C) 2015 Elsevier Inc. All rights reserved.
What problem does this paper attempt to address?