LSPT-D: Local Similarity Preserved Transport for Direct Industrial Data Imputation
Hao Wang,Xinggao Liu,Zhaoran Liu,Haozhe Li,Yilin Liao,Yuxin Huang,Zhichao Chen
DOI: https://doi.org/10.1109/tase.2024.3506835
IF: 6.636
2024-01-01
IEEE Transactions on Automation Science and Engineering
Abstract:Accurate imputation of missing data is pivotal in real-world industrial applications. Traditional direct imputers, which utilize basic statistics to replace missing elements, offer a practical solution but struggle to adapt to the complex patterns in industrial data, leaving a gap in the research landscape. This study explores the untapped potential of direct imputers, enhancing their adaptability and capacity to handle complex patterns in industrial data through optimal transport (OT) theory, with a focus on preserving local sample-wise similarity as an exemplar. To these ends, we construct a Local Similarity Preserved Transport (LSPT) problem, with a solution algorithm based on the Frank-Wolfe technique to compute transport cost. Subsequently, we propose the LSPT-D framework, which employs the transport cost of LSPT for distribution matching, directing the gradient flow to the missing data points to update the imputations directly. This strategy maintains local similarity throughout the imputation process thereby enhancing the overall imputation quality. Our experiments demonstrate that LSPT-D outperform various baselines in industrial missing data imputation. Note to Practitioners —Accurate missing data imputation is essential for enhancing the reliability of data analytics and reducing decision-making risks in industrial automation. This study introduces LSPT-D, a non-parametric imputation technique based on OT technology. Unique to LSPT-D is its ability to preserve local similarity during the imputation process, rendering it particularly advantageous for datasets with varying operational phases and load conditions. In industrial applications, LSPT-D not only significantly improves imputation quality compared to various baseline methods but also maintains modest running costs. Additionally, it serves as an exemplar for developing OT-based imputation strategies that capitalize on the inherent properties of data to improve imputation performance. However, LSPT-D operates under the independent and identically distributed assumption and is thus best applied in scenarios where temporal dependencies, such as trends and seasonality, are minimal or have been previously neutralized.