Abstract:We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal components method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small the number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method shows good performances in terms of bias and coverage for an analysis model such as a main effects logistic regression model. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to deal with the missing values in incomplete categorical data. Specifically, the author proposes a multiple imputation method based on Multiple Correspondence Analysis (MCA), namely Multiple Imputation using MCA (MIMCA), to fill in the missing values in categorical variables. This method takes advantage of the dimensionality - reduction feature of MCA, enabling efficient imputation even on high - dimensional data sets while maintaining the consistency of the data structure. ### Core Problems of the Paper 1. **Missing - value Treatment** - Deal with the missing values in data sets containing categorical variables. - The proposed method can handle high - dimensional data sets, that is, cases where there are a large number of variables, a large number of categories for each variable, or a small number of individuals. 2. **Performance of the Imputation Method** - Evaluate the performance of the MIMCA method through simulation studies and compare it with existing reference methods (such as multiple imputation using log - linear models, multiple imputation using logistic regression) and other state - of - the - art methods (such as multiple imputation using random forests, multiple imputation using Dirichlet process mixture models). - Comparison metrics include bias and coverage, especially the performance in analysis models such as the main - effect logistic regression model. 3. **Computational Efficiency** - The computation time of the MIMCA method on high - dimensional data sets is significantly lower than that of other multiple imputation methods. ### Method Overview - **Multiple Correspondence Analysis (MCA)**: MCA is a principal component analysis method specifically for categorical data, which can reduce the dimension of data, thereby reducing the number of parameters to be estimated. - **Non - parametric Bootstrap**: Reflect the uncertainty of imputation model parameters through the non - parametric Bootstrap method. - **Multiple Imputation (MI)**: Generate multiple imputation data sets to reflect the uncertainty of missing values, and finally combine the results of these data sets according to Rubin's rules. ### Key Formulas - **Dimensionality - reduction Representation of MCA** \[ X \approx \hat{X} = U \Lambda V^T \] where \( X \) is the original data matrix, \( U \) and \( V \) are the left and right singular vectors respectively, and \( \Lambda \) is the singular value matrix. - **Non - parametric Bootstrap** - Draw samples with replacement from the original data set to generate multiple Bootstrap samples. - Conduct MCA analysis on each Bootstrap sample to obtain multiple imputation parameter sets. - **Multiple Imputation** - Use the parameter set of each Bootstrap sample to impute the original data set, generating multiple imputation data sets. - Combine the results of multiple imputation data sets according to Rubin's rules to obtain the final parameter estimate and its variance. ### Conclusion The MIMCA method performs excellently in dealing with missing values in high - dimensional categorical data sets. It is superior to other methods not only in bias and coverage but also has obvious advantages in computational efficiency. This makes MIMCA a powerful tool for dealing with complex categorical data sets.

MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence

Multiple Imputation with Multivariate Imputation by Chained Equation (mice) Package

Supervised dimensionality reduction for multiple imputation by chained equations

Solving the "many variables" problem in MICE with principal component regression

Population-calibrated multiple imputation for a binary/categorical covariate in categorical regression models

Nonparametric Statistical Inference and Imputation for Incomplete Categorical Data

Multiple Imputation Methods for Missing Multilevel Ordinal Outcomes

A Comparative Study of Imputation Methods for Multivariate Ordinal Data

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Multiple Imputation for Multilevel Data with Continuous and Binary Variables

Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation

Model-based standardization using multiple imputation

Multiple Imputation When Variables Exceed Observations: An Overview of Challenges and Solutions

Multiple imputation with competing risk outcomes

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival

Large-sample properties of multiple imputation estimators for parameters of logistic regression with covariates missing at random separately or simultaneously

Imputing Missing Data by Fully Conditional Models : Some Cautionary Examples and Guidelines

EvoImp: Multiple Imputation of Multi-label Classification data with a genetic algorithm

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study