Semi-Supervised Feature Selection of Educational Data Mining for Student Performance Analysis

Shanshan Yu,Yiran Cai,Baicheng Pan,Man-Fai Leung
DOI: https://doi.org/10.3390/electronics13030659
IF: 2.9
2024-02-06
Electronics
Abstract:In recent years, the informatization of the educational system has caused a substantial increase in educational data. Educational data mining can assist in identifying the factors influencing students' performance. However, two challenges have arisen in the field of educational data mining: (1) How to handle the abundance of unlabeled data? (2) How to identify the most crucial characteristics that impact student performance? In this paper, a semi-supervised feature selection framework is proposed to analyze the factors influencing student performance. The proposed method is semi-supervised, enabling the processing of a considerable amount of unlabeled data with only a few labeled instances. Additionally, by solving a feature selection matrix, the weights of each feature can be determined, to rank their importance. Furthermore, various commonly used classifiers are employed to assess the performance of the proposed feature selection method. Extensive experiments demonstrate the superiority of the proposed semi-supervised feature selection approach. The experiments indicate that behavioral characteristics are significant for student performance, and the proposed method outperforms the state-of-the-art feature selection methods by approximately 3.9% when extracting the most important feature.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
The paper proposes a solution to two key issues in Educational Data Mining (EDM): 1. **How to handle a large amount of unlabeled data?** In educational data, there is usually a large amount of unlabeled data (such as students' classroom performance, discussion records, etc.). This data contains rich information but lacks corresponding labels. The method proposed in the paper can handle a large amount of unlabeled data using a small amount of labeled data. 2. **How to identify key features that affect student performance?** The paper points out that educational datasets often contain many irrelevant features, which may affect the accuracy of the model. Therefore, determining which features are crucial to students' academic performance is an important task. This paper proposes a semi-supervised feature selection method aimed at identifying the features that have the greatest impact on students' learning outcomes. To address the above issues, the paper proposes a method called SFSGLR (Semi-Supervised Feature Selection based on Generalized Linear Regression). Specifically, this method combines the idea of semi-supervised learning and can handle a large amount of unlabeled data with only a small amount of labeled data. By solving the feature selection matrix, the importance of each feature can be determined and ranked. Additionally, the paper uses various commonly used classifiers to evaluate the effectiveness of the proposed feature selection method. Experimental results show that this method performs excellently in identifying key features, especially in extracting the most important features, improving performance by approximately 3.9% compared to existing state-of-the-art feature selection methods. The study also found that behavioral features are particularly important for student performance, providing valuable insights for educators and policymakers to develop targeted teaching strategies and interventions based on these features.