RIFLE: Imputation and Robust Inference from Low Order Marginals

Sina Baharlouei,Kelechi Ogudu,Sze-chuan Suen,Meisam Razaviyayn
2023-09-13
Abstract:The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for regression and classification in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE to several state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for imputation and inference in the presence of missing values. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available at <a class="link-external link-https" href="https://github.com/optimization-for-data-driven-science/RIFLE" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Mathematical Software
What problem does this paper attempt to address?
The paper attempts to address the challenge of statistical inference (including regression and classification tasks) in the presence of a large number of missing values in datasets. Specifically, when the sample size in the dataset is small and the proportion of missing values is high, existing data imputation methods often perform poorly, directly affecting the performance of downstream statistical models. To solve this problem, the paper proposes a statistical inference framework called RIFLE (Robust InFerence via Low-order moment Estimations), which does not require prior data imputation. Instead, it constructs a distributionally robust model by estimating the low-order moments (such as mean and variance) of the data distribution and their confidence intervals. RIFLE is particularly suitable for linear regression and normal discriminant analysis and provides convergence and performance guarantees. ### Main Contributions of the Paper: 1. **Proposed a distributionally robust optimization framework based on low-order moments**: This framework allows for direct statistical inference in the presence of missing values without the need for prior data imputation. The paper applies this framework to ridge regression and classification models, providing a new strategy for handling datasets with a large number of missing values. 2. **Provided theoretical convergence and iterative complexity analysis**: For the robustified ridge linear regression and normal discriminant analysis models, the paper provides theoretical convergence guarantees and analyzes the asymptotic statistical properties of the algorithm solutions. 3. **Demonstrated RIFLE's performance in data imputation**: Although RIFLE is primarily designed for direct statistical inference, it can also be used as a data imputation tool. The paper demonstrates RIFLE's superior performance in handling a large number of missing values by comparing it with several widely used imputation packages (such as MICE, Amelia, MissForest, etc.) on real and synthetic datasets. ### Core Ideas of the RIFLE Framework: - **Estimate low-order moments and their confidence intervals**: Using Bootstrap techniques to estimate the low-order moments (such as mean and variance) of the data distribution and their confidence intervals from the available data. - **Construct a distributionally robust optimization problem**: Find the optimal model parameters in the worst-case scenario over the set of distributions where all low-order moments lie within the estimated confidence intervals. - **Applicable to various statistical models**: RIFLE is not limited to specific machine learning models, such as support vector machines, but can be applied to various statistical models, including linear regression and classification models. ### Application Examples: - **Linear Regression**: The paper details how to apply RIFLE to ridge linear regression, designing efficient algorithms to solve the problem and demonstrating how to use these algorithms for inference and imputation. - **Classification Tasks**: The paper also explores how to apply RIFLE to classification tasks, particularly quadratic discriminant analysis. By assuming that the conditional distribution of each class has a different covariance matrix, the paper proposes a robust quadratic discriminant analysis model. In summary, the RIFLE framework provides a powerful tool for handling datasets with a large number of missing values, excelling not only in statistical inference but also in data imputation.