Semi-Supervised Robust Hidden Markov Regression for Large-Scale Time-Series Industrial Data Analytics and Its Applications to Soft Sensing
Weiming Shao,Wenxue Han,Chuanfa Xiao,Lei Chen,Meng-Qin Yu,Junghui Chen
DOI: https://doi.org/10.1109/tase.2024.3417019
IF: 6.636
2024-01-01
IEEE Transactions on Automation Science and Engineering
Abstract:Hidden Markov models (HMMs) for time-series data analysis are attracting wide interests in industries due to their ability to model the extensively existing dynamics and non-Gaussianities. In this paper, with the focus on industrial soft sensor applications, a semi-supervised robust hidden Markov regression (SsRHMR) model is first proposed to improve the performance of the HMMs in two challenging industrial scenarios, i.e., the scarcity of labeled samples and outlying data, which may prevent the HMMs from learning well-suited parameters. Furthermore, a distributed learning algorithm for the SsRHMR (termed D-SsRHMR) is developed to overcome the limitations of the HMMs in modeling large-scale time-series data, namely computational complexity and inability of handling long-period missing values. Performance evaluations of both the SsRHMR and D-SsRHMR are presented using a synthetic case and an actual process, based on which the effectiveness and feasibility of the proposed models and learning algorithms in improving the prediction accuracy and in accelerating the training speed have been demonstrated. Note to Practitioners —Before applying the SsRHMR to industrial soft sensing, we advise to first select features based on the process mechanisms and expert knowledge. That is, to carefully select the secondary variables so as to reduce the dimensionality of the input space. This is because, in general the lower the dimensionality of the secondary variables, the more accurate the estimated distributions of the secondary variables and the more efficient the training process for the SsRHMR. In addition, the D-SsRHMR would benefit from equal-sized subsets, since the efficiency of the distributed learning algorithm depends on the most computationally demanding slave computer, such as the one processing the largest number of data. Therefore, practically it is preferable for the D-SsRHMR to partition the entire time-series dataset with as equal size as possible.