Improved Document Feature Selection with Categorical Parameter for Text Classification.
Fen Wang,Xiaoxuan Li,Xiaotao Huang,Ling Kang
DOI: https://doi.org/10.1007/978-3-319-50463-6_8
2016-01-01
Abstract:Social network develops rapidly and thousands of new data appears on the Internet every day. Classification technology is the key to organize big data. Feature Selection (FS) is a direct way to improve classification efficiency. FS can reduce the size of the feature subset and ensure classification accuracy based on features' score, which is calculated by FS methods. Most previous studies of FS emphasized on precision while time-efficiency was commonly ignored. In our study, we proposed a method named CDFDC at first. It combines both CDF and Category-Frequency. Secondly, we compared DF, CDF, CHI, IG, CDFP VM and CDFDC to figure out the relationships among algorithm complexity, time efficiency and classification accuracy. The experiment is implemented with 20-news-group data set and NB classifier. The performance of the FS methods evaluated by seven aspects: precision, Micro F1, Macro F1, feature-selection-time, documents-conversion-time, training-time and classification-time. The result shows that the proposed method performs well on efficiency and accuracy when the size of feature subset is greater than 3,000. And it is also discovered that FS algorithm's complexity is unrelated to accuracy but complexity can ensure time stability and predictability.