Abstract:Addressing the challenge of missing values is a critical step when preparing and analyzing data. This process, known as imputation, helps ensure the dataset is complete, accurate, and reliable. As a result, the possibility of bias and errors in subsequent analysis is significantly reduced. The key contribution of this work is to assess the efficiency of imputation by feature importance employing several base learning algorithms. This study investigates the effectiveness of individual and ensemble machine learning methods as the base learning algorithms, including support vector machines with the linear kernel (SVML), boosted linear regression (BLR), deep boost (DBP), and K-Nearest Neighbor (K-NN), in predicting missingness patterns. The dataset for each category explicitly introduces missingness patterns, including missing not at random (MNAR), at random (MAR), and completely at random (MCAR) at different percentages (15%, 45%, 25%), and Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) are among the commonly used performance matrices employed to gauge the effectiveness of the IBFI framework. The dataset for this study, comprising soil radon and thoron gas concentration time series along with meteorological parameters, the dataset spans a 14-month period. Four earthquake events were recorded during the whole study period. The deep boosting model (DBP) consistently outperforms other base learning models in imputing missing values across various variables within the imputation by feature importance (IBFI) framework. Specifically, DBP achieves an average RMSE value of 573.165 for the Radon variable under MCAR scenarios. For the Thoron variable, DBP demonstrates impressive performance with average MAPE values of 0.7405, 0.7249, and 0.8212 under MCAR, MNAR, and MAR conditions respectively. Additionally, DBP yields competitive results for imputing missing entries in Temperature, Relative Humidity, and Pressure variables. These findings highlight effectiveness of DBP in accurately predicting missing values. This study concludes that the IBFI with deep boosting model executes the imputations quite accurately relative to other base learning models. Moreover, this study recommends using DBP as a base learning algorithm in imputation by feature importance framework for uncovering hidden patterns in time series data like soil radon gas. The replication of the study using heterogeneous datasets would enhance the understanding of the generalization and broader applicability of the imputation by feature importance.

Comparing machine learning algorithms for imputation of missing time series in meteorological data

Long-Term Missing Value Imputation for Time Series Data Using Deep Neural Networks

Comparison of Missing Data Imputation Methods in Time Series Forecasting

Comparative Simulation Study of Classical and Machine Learning Techniques for Forecasting Time Series Data

Time Series Reconstruction With Feature-Driven Imputation: A Comparison of Base Learning Algorithms

BiLSTM-I: A Deep Learning-Based Long Interval Gap-Filling Method for Meteorological Observation Data

Rainfall prediction: A comparative analysis of modern machine learning algorithms for time-series forecasting

Time Series Imputation with Multivariate Radial Basis Function Neural Network

Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

An End-to-End Model for Time Series Classification In the Presence of Missing Values

Effective LSTMs with Seasonal-Trend Decomposition and Adaptive Learning and Niching-Based Backtracking Search Algorithm for Time Series Forecasting

A CNN-BiLSTM and KNN Based Missing Data Imputation for Wind Power Generation Forecasting

Autoregressive-Model-Based Methods for Online Time Series Prediction with Missing Values: an Experimental Evaluation

A Data Filling Methodology for Time Series Based on CNN and (Bi)LSTM Neural Networks

A comparative analysis of machine learning approaches to gap filling meteorological datasets

MuSDRI: Multi-Seasonal Decomposition Based Recurrent Imputation for Time Series

Missing data imputation for multisite rainfall networks: a comparison between geostatistical interpolation and data-mining estimation on different terrain types

A Comparative Study of Detecting Anomalies in Time Series Data Using LSTM and TCN Models

Smoothed LSTM-AE: A spatio-temporal deep model for multiple time-series missing imputation

MBGAN: An improved generative adversarial network with multi-head self-attention and bidirectional RNN for time series imputation

A New Imputation Technique Based a Multi-Spike Neural Network to Handle Missing Data in the Internet of Things Network (IoT)