MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis

Vincent Audigier,François Husson,Julie Josse
DOI: https://doi.org/10.48550/arXiv.1505.08116
2015-05-30
Abstract:We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal components method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small the number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method shows good performances in terms of bias and coverage for an analysis model such as a main effects logistic regression model. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.
Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to deal with the missing values in incomplete categorical data. Specifically, the author proposes a multiple imputation method based on Multiple Correspondence Analysis (MCA), namely Multiple Imputation using MCA (MIMCA), to fill in the missing values in categorical variables. This method takes advantage of the dimensionality - reduction feature of MCA, enabling efficient imputation even on high - dimensional data sets while maintaining the consistency of the data structure. ### Core Problems of the Paper 1. **Missing - value Treatment** - Deal with the missing values in data sets containing categorical variables. - The proposed method can handle high - dimensional data sets, that is, cases where there are a large number of variables, a large number of categories for each variable, or a small number of individuals. 2. **Performance of the Imputation Method** - Evaluate the performance of the MIMCA method through simulation studies and compare it with existing reference methods (such as multiple imputation using log - linear models, multiple imputation using logistic regression) and other state - of - the - art methods (such as multiple imputation using random forests, multiple imputation using Dirichlet process mixture models). - Comparison metrics include bias and coverage, especially the performance in analysis models such as the main - effect logistic regression model. 3. **Computational Efficiency** - The computation time of the MIMCA method on high - dimensional data sets is significantly lower than that of other multiple imputation methods. ### Method Overview - **Multiple Correspondence Analysis (MCA)**: MCA is a principal component analysis method specifically for categorical data, which can reduce the dimension of data, thereby reducing the number of parameters to be estimated. - **Non - parametric Bootstrap**: Reflect the uncertainty of imputation model parameters through the non - parametric Bootstrap method. - **Multiple Imputation (MI)**: Generate multiple imputation data sets to reflect the uncertainty of missing values, and finally combine the results of these data sets according to Rubin's rules. ### Key Formulas - **Dimensionality - reduction Representation of MCA** \[ X \approx \hat{X} = U \Lambda V^T \] where \( X \) is the original data matrix, \( U \) and \( V \) are the left and right singular vectors respectively, and \( \Lambda \) is the singular value matrix. - **Non - parametric Bootstrap** - Draw samples with replacement from the original data set to generate multiple Bootstrap samples. - Conduct MCA analysis on each Bootstrap sample to obtain multiple imputation parameter sets. - **Multiple Imputation** - Use the parameter set of each Bootstrap sample to impute the original data set, generating multiple imputation data sets. - Combine the results of multiple imputation data sets according to Rubin's rules to obtain the final parameter estimate and its variance. ### Conclusion The MIMCA method performs excellently in dealing with missing values in high - dimensional categorical data sets. It is superior to other methods not only in bias and coverage but also has obvious advantages in computational efficiency. This makes MIMCA a powerful tool for dealing with complex categorical data sets.