Abstract:Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets.

What problem does this paper attempt to address?

This paper focuses on the efficient and high-quality handling of missing data in database systems. Traditional methods such as deleting incomplete records or filling with average values may introduce bias and disrupt variable relationships, leading to inaccurate analyses, although they are computationally efficient. Model-based imputation methods, such as Multiple Imputation by Chained Equations (MICE), can better preserve the diversity and relationships of data, but they come with high computational costs, limiting their application in large datasets. The paper proposes an improved MICE method that utilizes computation sharing and cycle abstraction to accelerate model training and achieve efficient data imputation within the database system. For imputing continuous and categorical values, they develop techniques for learning random linear regression and Gaussian discriminant analysis models within the database. The implementation in PostgreSQL and DuckDB is two orders of magnitude faster than other MICE implementations and model-based imputation techniques, while maintaining high imputation quality. The challenges addressed in the paper include: 1) how to implement model-based imputation in a database management system, 2) long imputation time, and 3) avoiding data explosion due to preprocessing steps such as JOIN and one-hot encoding. To address these issues, the paper proposes an optimized MICE algorithm that leverages the performance and scalability of the database system to accelerate model training and data imputation, and reduces data redundancy through cycle and factorization optimizations. Experimental results demonstrate that the proposed in-database imputation method outperforms existing methods in terms of computation time and imputation quality, particularly for datasets with a low proportion of missing values. The code has been open-sourced and can be used in PostgreSQL and DuckDB.

In-Database Data Imputation

Missing Data Imputation: Focusing on Single Imputation.

A web-based approach to data imputation

19 Incomplete Data in Epidemiology and Medical Statistics

Missing Data Imputation by Utilizing Information Within Incomplete Instances

Internal Data Imputation in Data Warehouse Dimensions

Automatic Web-based relational data imputation

On-Line Imputation For Missing Values

An Intelligent Missing Data Imputation Techniques: A Review

Missing Values Imputation Based on Iterative Learning

A Benchmark for Data Imputation Methods

Multiple Imputation for Data-Base Construction

Does imputation matter? Benchmark for predictive models

Introduction to Bayesian Data Imputation

Do we really need imputation in AutoML predictive modeling?

Imputing Missing Data by Fully Conditional Models : Some Cautionary Examples and Guidelines

Relational Data Imputation with Quality Guarantee.

TRIP: an Interactive Retrieving-Inferring Data Imputation Approach

Evaluation of imputation techniques with varying percentage of missing data

Statistical Data, Missing

An Experimental Survey of Missing Data Imputation Algorithms