Modern Multiple Imputation with Functional Data

Aniruddha Rajendra Rao,Matthew Reimherr
DOI: https://doi.org/10.1002/sta4.331
2020-11-25
Abstract:This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with the sample size, whereas, we show numerically that a modified approach with more modern multiple imputation methods can produce better estimates in general. We also propose a new imputation approach that combines the ideas of {\it MissForest} with {\it Local Linear Forest} and compare their performance with {\it PACE} and several other multivariate multiple imputation methods. This work is motivated by a longitudinal study on smoking cessation, in which the Electronic Health Records (EHR) from Penn State PaTH to Health allow for the collection of a great deal of data, with highly variable sampling. To illustrate our approach, we explore the relation between relapse and diastolic blood pressure. We also consider a variety of simulation schemes with varying levels of sparsity to validate our methods.
Methodology,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the problem of fitting complex nonlinear models when dealing with sparse and irregularly sampled functional data. Current methods face challenges in estimating these models because consistent estimates cannot be obtained unless the number of observed points on each curve grows fast enough with sample size. The paper proposes an improved multiple imputation method that combines the ideas of MissForest (random forest) and local linear forests to enhance estimation accuracy. Additionally, a new imputation method is proposed and compared with existing methods such as MissForest and MICE (Multivariate Imputation by Chained Equations). The paper points out that single imputation methods, such as mean imputation or PACE (Principal-based imputation for Covariance Estimation), while useful, cannot handle the uncertainty introduced by imputation, which may result in inflated uncertainty measures and potential biases. To address these issues, the paper considers multiple imputation methods that create multiple "complete" datasets by filling in missing values multiple times, reflecting the uncertainty in the imputation process. The authors implemented these methods using the MICE and missForest packages in R and incorporated the idea of local linear forests (LLF) into the new imputation method to accommodate the smoothing properties of functional data. Using a longitudinal study in smoking cessation research as an example, the paper explores the relationship between diastolic blood pressure and relapse and validates the effectiveness of the methods through various simulation scenarios. The study also highlights the limitations of existing Functional Data Analysis (FDA) methods in handling complex models, especially for sparse functional data where current methods may not be applicable for estimating nonlinear models. The paper proposes improved imputation strategies, including binning and cautious initialization, to improve estimation for both linear and nonlinear models. In conclusion, the paper attempts to address the effective handling of sparse and irregularly sampled data in Functional Data Analysis to improve estimation accuracy for complex nonlinear models while reducing the impact of uncertainty.