Abstract:Missing values or incomplete data are commonly encountered in clinical research and are studied by many authors. Basically, the causes of missing values in a study can be classified into two categories. The first category includes the reasons that are not directly related to the study. For example, a patient may be lost to follow-up because he/she moves out of the area. This category of missing values can be considered as missing completely at random. The second category includes the reasons that are related to the study. For example, a patient may withdraw from the study due to treatment-emergent adverse events. In practice, it is not uncommon to have multiple assessments from each subject. Subjects with all observations missing are called unit nonrespondents. Because unit nonrespondents do not provide any useful information, these subjects are usually excluded from the analysis. On the other hand, the subjects with some, but not all, observations missing are referred to as item nonrespondents. In practice, excluding item nonrespondents from the analysis is considered against the intent-to-treat (ITT) principle and, hence, not acceptable. In clinical research, the primary analysis is usually conducted based on ITT population, which includes all randomized subjects with at least posttreatment evaluation. As a result, most item nonrespondents may be included in the ITT population. In practice, excluding item nonrespondents may seriously decrease power/efficiency of the study. To account for item nonrespondents, two methods are commonly considered. The first method is the so-called likelihood-based method. Under a parametric model, the marginal likelihood function for the observed responses is obtained by integrating out the missing responses. The parameter of interest can then be estimated by the maximum likelihood estimator (MLE). Consequently, a corresponding test (e.g., likelihood ratio test) can be constructed. The merit of this method is that the resulting statistical procedures are usually efficient. The drawback is that the calculation of the marginal likelihood could be difficult. As a result, some special statistical or numerical algorithms are commonly applied for obtaining the MLE. For example, the expectation–maximization (EM) algorithm is one of the most popular methods for obtaining the MLE when there are missing data. The other method for item nonrespondents is imputation. Compared with the likelihood-based method, the method of imputation is relatively simple and easy to apply. The idea of imputation is to treat the imputed values as the observed values and then apply the standard statistical software for obtaining consistent estimators. However, it should be noted that the variability of the estimator obtained by imputation is usually different from the estimator obtained from the complete data. In this case, the formulas designed to estimate the variance of the complete data set cannot be used to estimate the variance of estimator produced by the imputed data. As an alternative, two methods are considered for estimation of its variability. One is based on Taylor’s expansion. This method is referred to as the ‘‘linearization method.’’ The merit of the linearization method is that it requires less computation. However, the drawback is that its formula could be very complicated and/or nontrackable. The other approach is based on resampling method (e.g., bootstrap and jackknife). The drawback of the resampling method is that it requires an intensive computation. The merit is that it is very easy to apply. With the help of a fast-speed computer, the resampling method has become much more attractive in practice. Note that imputation is not only popular in clinical research, it is also very popular in many other statistical fields such as sample survey. However, the imputation methods in clinical research are more diversified due to the complexity of the study design relative to sample survey. As a result, the statistical properties of many commonly used imputation methods in clinical research are still unknown, while most imputation methods used in sample survey are well studied. Hence, the imputation methods in clinical research provide a unique challenge and also an opportunity for the statisticians in the area of clinical research. In what follows, we will summarize the most commonly used imputation methods and investigate their statistical properties. Recent development will also be discussed.

Handling missing values in trait data

Missing Data Imputation: Focusing on Single Imputation.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Benchmarking imputation methods for categorical biological data

Imputation methods for mixed datasets in bioarchaeology

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Statistical primer: how to deal with missing data in scientific research?†

19 Incomplete Data in Epidemiology and Medical Statistics

To Impute or not to Impute? Missing Data in Treatment Effect Estimation

No imputation without representation

Imputation in Clinical Research

Missing Values in Big Data Research: Some Basic Skills

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

In-Database Data Imputation

Statistical Data, Missing

Evaluation of imputation techniques with varying percentage of missing data

Missing Data Exploration: Highlighting Graphical Presentation of Missing Pattern.

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

Evaluation of different approaches for missing data imputation on features associated to genomic data

The Analysis of Social-Science Data with Missing Values

Missing values imputation hypothesis: An experimental evaluation