Influence of Word Normalization and Chi-Squared Feature Selection on Support Vector Machine (SVM) Text Classification

Muljono,Edy Kholid Mawardi,Ardy Wibowo Haryanto
DOI: https://doi.org/10.1109/ISEMANTIC.2018.8549748
2018-09-01
Abstract:In this study, we used SVM for text classification. There is stemming or Iemmatization word normalization with the addition of Chi-squarefeature selection on the classification that we made. There are also pre-processing data being performed, namely stopwords removal and tokenize. We used BBC dataset containing 2,225 documents and 5 categories. There are 21,813. features resulting from the use of stemming and 31,007 features resulting from the use of lemmatization. Each feature represents the number of words that come out in the document. We used confusion matrix to evaluate the results of text clasification. SVM text classification performance using stemming enhanced by Chi-squared (method 1) get better results than using lemmatization enhanced by Chi-squared (method 2). The best performance was obtained using 80% feature reduction where method 1 received a precision value of 95%, a recall value of 95%, and an accuracy value of 95.05%. Method 2 only received a precision value of 93%, a recall value of 93%, and an accuracy value of 93.24% using the same amount of feature reduction.
Computer Science
What problem does this paper attempt to address?