Deep Generative Imputation Model for Missing Not At Random Data

Jialei Chen,Yuanbo Xu,Pengyang Wang,Yongjian Yang
DOI: https://doi.org/10.1145/3583780.3614835
2023-08-16
Abstract:Data analysis usually suffers from the Missing Not At Random (MNAR) problem, where the cause of the value missing is not fully observed. Compared to the naive Missing Completely At Random (MCAR) problem, it is more in line with the realistic scenario whereas more complex and challenging. Existing statistical methods model the MNAR mechanism by different decomposition of the joint distribution of the complete data and the missing mask. But we empirically find that directly incorporating these statistical methods into deep generative models is sub-optimal. Specifically, it would neglect the confidence of the reconstructed mask during the MNAR imputation process, which leads to insufficient information extraction and less-guaranteed imputation quality. In this paper, we revisit the MNAR problem from a novel perspective that the complete data and missing mask are two modalities of incomplete data on an equal footing. Along with this line, we put forward a generative-model-specific joint probability decomposition method, conjunction model, to represent the distributions of two modalities in parallel and extract sufficient information from both complete data and missing mask. Taking a step further, we exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space and concurrently impute the incomplete data and reconstruct the missing mask. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins (averagely improved from 9.9% to 18.8% in RMSE) and always gives a better mask reconstruction accuracy which makes the imputation more principle.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to deal with the situation of "Missing Not At Random (MNAR)" in the data. In many real - world scenarios, the missing of data is not completely random but related to the unobserved data, which makes traditional missing - data processing methods (such as those under the MCAR and MAR assumptions) less effective. ### Specific problem description: 1. **Complexity of the MNAR mechanism**: - Compared with MCAR (Missing Completely At Random) and MAR (Missing At Random), the MNAR mechanism is more complex and more in line with the actual situation. For example, users are unwilling to rate products they are not interested in, or people with financial difficulties are more likely to refuse to answer income - related surveys. 2. **Limitations of existing methods**: - Existing statistical methods model the MNAR mechanism by decomposing the joint distribution of complete data and missing masks, but these methods do not work well when directly applied to deep generative models. In particular, they ignore the confidence of the reconstruction mask during the MNAR imputation process, resulting in insufficient information extraction and a decline in imputation quality. 3. **Information bottleneck problem**: - Existing methods usually adopt a serial structure, that is, first generate complete data and then map it to the missing mask. This structure will lead to an information bottleneck, especially in high - dimensional data, making it difficult to distinguish between observed data and missing data, thus affecting the accuracy of mask reconstruction. ### Solutions proposed in the paper: To solve the above problems, the paper proposes a new generative model framework - **GNR (Generative Network for Reconstruction)**, which specifically includes the following aspects: 1. **Conjunction Model with Parallel Structure**: - Consider the complete data and the missing mask as two equal modalities, and introduce an auxiliary variable \(u\) to simultaneously model the joint distribution of these two modalities. The specific formula is: \[ p_{\theta, \phi}(x, m, u)=p_{\theta}(u) p_{\phi_1}(x | u) p_{\phi_2}(m | u) \] - This parallel structure avoids the information bottleneck and can fully extract the information in the data space and the mask space. 2. **Deep Generative Imputation Model**: - The GNR model uses the Variational Auto - Encoder (VAE) framework to handle the real missing mechanism in the latent space and simultaneously impute the missing data and reconstruct the missing mask. Experimental results show that GNR outperforms existing MNAR baseline methods on multiple datasets, with an average improvement of 9.9% - 18.8% in RMSE, and also has a significant improvement in mask reconstruction accuracy. ### Summary: This paper aims to solve the challenges of data - missing imputation under the MNAR mechanism. It proposes the conjunction model based on the parallel structure and the deep generative imputation model GNR, which effectively improves the quality of missing - data imputation and the accuracy of mask reconstruction.