In-Database Data Imputation

Massimo Perini,Milos Nikolic
DOI: https://doi.org/10.1145/3639326
2024-01-07
Abstract:Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets.
Databases,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the efficient and high-quality handling of missing data in database systems. Traditional methods such as deleting incomplete records or filling with average values may introduce bias and disrupt variable relationships, leading to inaccurate analyses, although they are computationally efficient. Model-based imputation methods, such as Multiple Imputation by Chained Equations (MICE), can better preserve the diversity and relationships of data, but they come with high computational costs, limiting their application in large datasets. The paper proposes an improved MICE method that utilizes computation sharing and cycle abstraction to accelerate model training and achieve efficient data imputation within the database system. For imputing continuous and categorical values, they develop techniques for learning random linear regression and Gaussian discriminant analysis models within the database. The implementation in PostgreSQL and DuckDB is two orders of magnitude faster than other MICE implementations and model-based imputation techniques, while maintaining high imputation quality. The challenges addressed in the paper include: 1) how to implement model-based imputation in a database management system, 2) long imputation time, and 3) avoiding data explosion due to preprocessing steps such as JOIN and one-hot encoding. To address these issues, the paper proposes an optimized MICE algorithm that leverages the performance and scalability of the database system to accelerate model training and data imputation, and reduces data redundancy through cycle and factorization optimizations. Experimental results demonstrate that the proposed in-database imputation method outperforms existing methods in terms of computation time and imputation quality, particularly for datasets with a low proportion of missing values. The code has been open-sourced and can be used in PostgreSQL and DuckDB.