Feature Extension and Category Research for Short Text Based on Spark Platform

Wen WANG,Kankan ZHAO,Cuiping LI,Hong CHEN,Hui SUN
DOI: https://doi.org/10.3778/j.issn.1673-9418.1608041
2017-01-01
Abstract:Short text classification is often confronted with some limitations including high feature dimensions,sparse feature existences and poor classification accuracy,which can be solved by feature extension effectively.However,it decreases the execution efficiency greatly.To improve classification accuracy and efficiency of short text,this paper proposes a new solution,association rule based feature extension method which is designed on Spark platform.Given a background data set of short text corpus,firstly extend origin corpus and complement the features by mining the association rules and the corresponding confidences.Then apply a new cascade SVM (support vector machine) algorithm based on distance to choose during classification.Finally design the feature extension and classification algorithm of short text on Spark platform and improve the efficiency of short text processing through distributed algorithm.The experiments show that the new method gains 4 times of efficiency improvement compared with the traditional method and 15% increase in classification accuracy,in which the accuracy of feature extension and classification optimization is 10% and 5% respectively.
What problem does this paper attempt to address?