Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection

masoumeh zareapoor,seeja k r
DOI: https://doi.org/10.5815/ijieeb.2015.02.08
2015-01-01
International Journal of Information Engineering and Electronic Business
Abstract:Dimensionality reduction is generally performed when high dimensional data like text are classified.This can be done either by using feature extraction techniques or by using feature selection techniques.This paper analyses which dimension reduction technique is better for classifying text data like emails.Email classification is difficult due to its high dimensional sparse features that affect the generalization performance of classifiers.In phishing email detection, dimensionality reduction techniques are used to keep the most instructive and discriminative features from a collection of emails, consists of both phishing and legitimate, for better detection.Two feature selection techniques -Chi-Square and Information Gain Ratio and two feature extraction techniques -Principal Component Analysis and Latent Semantic Analysis are used for the analysis.It is found that feature extraction techniques offer better performance for the classification, give stable classification results with the different number of features chosen, and robustly keep the performance over time.
What problem does this paper attempt to address?