The More Data, the Better? Demystifying Deletion-Based Methods in Linear Regression with Missing Data

Tianchen Xu,Kun Chen,Gen Li
DOI: https://doi.org/10.4310/21-sii717
2020-10-26
Abstract:We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.
Applications,Methodology
What problem does this paper attempt to address?
This paper investigates the performance comparison between two deletion methods for handling missing data in linear regression analysis: complete case analysis (CC, also known as listwise deletion) and available case analysis (AC, also known as pairwise deletion). The author points out that although the AC method utilizes all available data, it does not necessarily lead to better estimation efficiency in many cases. The paper finds through theoretical analysis and simulation studies that the superiority of either method depends on the missing pattern, covariance structure, and true regression coefficient values. The study also indicates that in certain circumstances, even with the use of more data (such as AC), its asymptotic efficiency is not necessarily superior to the use of less data (such as CC). Additionally, the paper reviews existing results of both methods and discusses other approaches for handling missing data, such as imputation, weighting, and maximum likelihood method. Finally, the paper provides further research directions.