Hybrid Top-K Feature Selection to Improve High-Dimensional Data Classification Using Naïve Bayes Algorithm

Riska Wibowo,M. Arief Soeleman,Affandy Affandy
DOI: https://doi.org/10.15294/sji.v10i2.42818
2023-04-20
Scientific Journal of Informatics
Abstract:Abstract. Purpose: The naive bayes algorithm is one of the most popular machine learning algorithms, because it is simple, has high computational efficiency and has good accuracy. The naive bayes method assumes each attribute contributes to determining the classification result that may exist between attributes, this can interfere with the classification performance of naive bayes. The naïve bayes algorithm is sensitive to many features so this can reduce the performance of naïve bayes. Efforts to improve the performance of the naïve bayes algorithm by using a hybrid top-k feature selection method that aims to handle high-dimensional data using the naïve bayes algorithm so as to produce better accuracy.Methods: This research proposes a hybrid top-k feature selection method with stages 1. Prepare the dataset, 2. Replace the missing value with the average value of each attribute, 3. Calculate the weight of the attribute value using the weight information gain method, 4. Select attributes using the top-k feature selection method, 5. Backward Elimination with the naïve bayes algorithm, 6. Datasets that have been selected new attributes, then validated using 10 fold-cross validation where the data is divided into training data and testing data, 7. Calculate the accuracy value based on the confusion matrix table.Result: Based on the experimental results of performance and performance comparison of several methods that have been presented (Naïve Bayes, deep feature weighting naïve bayes, top-k feature selection, and hybrid top-k feature selection). The experimental results in this study show that from 5 datasets from UCI Repository that have been tested, the accuracy value of the hybrid top-k feature selection method increases from the previous method. From the accuracy comparison results that the proposed hybrid top-k feature selection method is ranked the first best method.Novelty: Thus it can be concluded that the Hybrid top-k feature selection method can be used to handle dimensional data in the Naïve Bayes algorithm.
What problem does this paper attempt to address?