Abstract:Missing values or incomplete data are commonly encountered in clinical research and are studied by many authors. Basically, the causes of missing values in a study can be classified into two categories. The first category includes the reasons that are not directly related to the study. For example, a patient may be lost to follow-up because he/she moves out of the area. This category of missing values can be considered as missing completely at random. The second category includes the reasons that are related to the study. For example, a patient may withdraw from the study due to treatment-emergent adverse events. In practice, it is not uncommon to have multiple assessments from each subject. Subjects with all observations missing are called unit nonrespondents. Because unit nonrespondents do not provide any useful information, these subjects are usually excluded from the analysis. On the other hand, the subjects with some, but not all, observations missing are referred to as item nonrespondents. In practice, excluding item nonrespondents from the analysis is considered against the intent-to-treat (ITT) principle and, hence, not acceptable. In clinical research, the primary analysis is usually conducted based on ITT population, which includes all randomized subjects with at least posttreatment evaluation. As a result, most item nonrespondents may be included in the ITT population. In practice, excluding item nonrespondents may seriously decrease power/efficiency of the study. To account for item nonrespondents, two methods are commonly considered. The first method is the so-called likelihood-based method. Under a parametric model, the marginal likelihood function for the observed responses is obtained by integrating out the missing responses. The parameter of interest can then be estimated by the maximum likelihood estimator (MLE). Consequently, a corresponding test (e.g., likelihood ratio test) can be constructed. The merit of this method is that the resulting statistical procedures are usually efficient. The drawback is that the calculation of the marginal likelihood could be difficult. As a result, some special statistical or numerical algorithms are commonly applied for obtaining the MLE. For example, the expectation–maximization (EM) algorithm is one of the most popular methods for obtaining the MLE when there are missing data. The other method for item nonrespondents is imputation. Compared with the likelihood-based method, the method of imputation is relatively simple and easy to apply. The idea of imputation is to treat the imputed values as the observed values and then apply the standard statistical software for obtaining consistent estimators. However, it should be noted that the variability of the estimator obtained by imputation is usually different from the estimator obtained from the complete data. In this case, the formulas designed to estimate the variance of the complete data set cannot be used to estimate the variance of estimator produced by the imputed data. As an alternative, two methods are considered for estimation of its variability. One is based on Taylor’s expansion. This method is referred to as the ‘‘linearization method.’’ The merit of the linearization method is that it requires less computation. However, the drawback is that its formula could be very complicated and/or nontrackable. The other approach is based on resampling method (e.g., bootstrap and jackknife). The drawback of the resampling method is that it requires an intensive computation. The merit is that it is very easy to apply. With the help of a fast-speed computer, the resampling method has become much more attractive in practice. Note that imputation is not only popular in clinical research, it is also very popular in many other statistical fields such as sample survey. However, the imputation methods in clinical research are more diversified due to the complexity of the study design relative to sample survey. As a result, the statistical properties of many commonly used imputation methods in clinical research are still unknown, while most imputation methods used in sample survey are well studied. Hence, the imputation methods in clinical research provide a unique challenge and also an opportunity for the statisticians in the area of clinical research. In what follows, we will summarize the most commonly used imputation methods and investigate their statistical properties. Recent development will also be discussed.

The `Why' behind including `Y' in your imputation model

Missing Data Imputation: Focusing on Single Imputation.

Imputation for prediction: beware of diminishing returns

Multiple Imputation When Variables Exceed Observations: An Overview of Challenges and Solutions

Does imputation matter? Benchmark for predictive models

Regression with missing Ys: An improved strategy for analyzing multiply imputed data

To Impute or not to Impute? Missing Data in Treatment Effect Estimation

On Missing Data Imputation for IRB Models

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

No imputation without representation

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study

On the Relation between Prediction and Imputation Accuracy under Missing Covariates

Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation

Explainability of Machine Learning Models under Missing Data

Random features models: a way to study the success of naive imputation

Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research

Imputation and Missing Indicators for handling missing data in the development and implementation of clinical prediction models: a simulation study

Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

Imputation in Clinical Research

How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data