Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification

Yuming Wang,Jun Huang,Yun Liu,Lai Tu,Ling Liu
DOI: https://doi.org/10.1007/978-3-319-94301-5_2
2018-01-01
Abstract:Text data is one of the dominating data types in Big Data driven services and applications. The performance of text classification largely depends on the quality of feature extraction over the text corpus. For supervised learning over text documents, the TF-IDF (Term Frequency-Inverse Document Frequency) weighting factor is one of the most frequently used features in text classification. In this paper, we address two known limitations of TF-IDF based feature extraction method: First, the conventional TF-IDF weighting factor lacks of consideration about the synonymous relationship between feature terms. Second, for big corpus with large number of text documents and large number of feature terms, the computational complexity of text classification increases with the dimensionality of the feature space. We address these problems by introducing an optimization technique based on the Inter-Category Distributions (ICD) of terms and the Inter-Category Distributions of documents. We call this new weighting factor TF-IDF-ICD, namely TF-IDF with Inter-Category Distributions. To further enhance the effectiveness of our TF-IDF-ICD method, we describe a TF-IDF-ICD threshold based Dimensionality Reduction (DR) optimization. We test the text classifier with a corpus of 10, 000 articles. The evaluation results show that the proposed TF-IDF-ICD based text classification method outperforms the conventional TF-IDF based classification solution by \(7.84\%\) at only about \(43.19\%\) of the training time used by the conventional TF-IDF based text classification methods.
What problem does this paper attempt to address?