Abstract:In recent years, interest in cognitive diagnostic assessments（CDAs）, as a new form of test, has increased drastically. Due to the specific design of the test, missing data is an inevitable problem in CDAs. Proper handling of missing data in CDAs is important to provide accurate diagnostic feedback to students and teachers.With the use of machine learning in education, relevant advancements have been made in missing data imputation. Research showed machine learning techniques have more desirable features for missing data imputation than traditional approaches. The random forest algorithm has been extended to become the random forest imputation（RFI） method in handling of CDAs missing data for CDAs. The method takes into consideration the characteristics of the data rather than assumes certain missing mechanism. RFI is a new non-parametric method that makes full use of the available response information and characteristics of response patterns to impute missing data.Making use of advantages of RFI in categorization/prediction and its non-reliant on missing mechanism type, we improved and proposed the new random forest threshold imputation（RFTI） method. It could be used to impute missing responses in the widely used DINA（Deterministic Inputs, Noise “And” Gate） model. This research proposed to apply the Response Conformity Index（RCI） in the missing data imputation to set the threshold of imputation and to develop a method for missing response treatment for CDAs without totally relying on imputation. Two simulation studies were conducted to compare the performance of the proposed method and traditional models. Study 1 began by introducing the theoretical background and algorithm implementation of RFTI. Then, RFTI and RFI were compared in terms of accuracy rate of imputation for data with different proportions of missingness（10%, 20%, 30%, 40%, 50%） and missing data mechanisms（MIXED,MNAR, MAR, MCAR）. This was to affirm the necessity of including RCI during imputation. Study 2 aimed to investigate the performance of RFTI, as well as RFI and EM algorithm in imputing missing data under different conditions. The manipulated design factors were identical to those in Study 1. We evaluated RFTI in terms of its accuracy in assessing the model attributes and item parameters. We also compared RFTI against the traditionally better performed EM and RFI under various design conditions to explore the advantages and conditions of using RFTI.Results of Study 1 showed that RFTI, as compared to RFI, improved accuracy when imputation threshold was one. In various design conditions, RFTI imputation rate and accuracy were also better. Study 2 showed that RFTI outperformed other methods（RFI, EM algorithm） in accurately assessing the attribute pattern and attribute margin. This advantage was affected by the missing data mechanism and the proportion of missing data. Notably, RFTI was particularly better than other methods in handling mixed type of missing or MNAR data, and when the proportion of missing data was higher than 30%. However, RFTI was not any better than other methods in its accuracy of item parameter estimates. In most conditions, EM algorithm provided the most accurate parameter estimates. In sum, we propose a method to impute missing data in CDAs by applying machine learning methods in measurement models. The advantage of this new method is affirmed through its accurate assessment of attribute pattern and attribute margin of DINA model. Theoretically, the current study provides a missing data imputation approach with less assumptions, which extends the traditional methods to impute missing data in CDAs framework. Moreover, we investigate how to estimate the attribute pattern of students accurately through the responses of a few items. It sheds lights on imputing missing data due to particularly designs in assessment or teaching.

Imputation and low-rank estimation with Missing Not At Random data

Model-based Clustering with Missing Not At Random Data

Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption

Matrix Completion for Survey Data Prediction with Multivariate Missingness

Matrix Completion under Low-Rank Missing Mechanism

Post-surgical Complication Prediction in the Presence of Low-Rank Missing Data.

Deep Generative Imputation Model for Missing Not At Random Data

Information-decomposition-model-based Missing Value Estimation for Not Missing at Random Dataset

A novel low-rank matrix completion approach to estimate missing entries in Euclidean distance matrices

Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable.

Sparse Data Reconstruction, Missing Value and Multiple Imputation through Matrix Factorization

Imputation of data Missing Not at Random: Artificial generation and benchmark analysis

Estimation beyond Missing (Completely) at Random

Matrix Completion with Model-free Weighting

Handling Nonmonotone Missing Data with Available Complete-Case Missing Value Assumption

Missing at Random or Not: A Semiparametric Testing Approach

Missing Data Analysis in Cognitive Diagnostic Models: Random Forest Threshold Imputation Method

The Analysis of Social-Science Data with Missing Values

Majorized Proximal Alternating Imputation for regularized rank constrained matrix completion

Low-rank matrix estimation via nonconvex optimization methods in multi-response errors-in-variables regression