Predictive Maintenance Study for High-Pressure Industrial Compressors: Hybrid Clustering Models

Alessandro Costa,Emilio Mastriani,Federico Incardona,Kevin Munari,Sebastiano Spinello
2024-11-21
Abstract:This study introduces a predictive maintenance strategy for high pressure industrial compressors using sensor data and features derived from unsupervised clustering integrated into classification models. The goal is to enhance model accuracy and efficiency in detecting compressor failures. After data pre processing, sensitive clustering parameters were tuned to identify algorithms that best capture the dataset's temporal and operational characteristics. Clustering algorithms were evaluated using quality metrics like Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI), selecting those most effective at distinguishing between normal and non normal conditions. These features enriched regression models, improving failure detection accuracy by 4.87 percent on average. Although training time was reduced by 22.96 percent, the decrease was not statistically significant, varying across algorithms. Cross validation and key performance metrics confirmed the benefits of clustering based features in predictive maintenance models.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use sensor data and clustering - based features to improve the accuracy and efficiency of predictive maintenance models for high - pressure industrial compressors, so as to better detect compressor failures**. Specifically, the authors propose a hybrid clustering model approach to solve the problem through the following steps: 1. **Data pre - processing**: - Clean the data and remove rows with missing or invalid data. - Calculate the autocorrelation matrix to identify highly correlated features and remove redundant features. - Use analysis of variance (ANOVA) to further evaluate the discriminatory power of the remaining features and retain statistically significant features (p - value < 0.05). - Standardize the data and create the target variable "NORMAL", defining the time period of event occurrence based on specific timestamps. 2. **Determine the optimal clustering parameters**: - For density - based clustering algorithms (such as HDBSCAN), determine the best epsilon value. Calculate the distances between neighboring points and draw a curvature graph to find the point of maximum curvature, thereby determining the epsilon value. - For algorithms such as K - Means, determine the optimal number of clusters (k). Use metrics such as the silhouette coefficient (Silhouette score) to select the best k value. 3. **Apply and evaluate clustering algorithms**: - Cluster multiple clustering algorithms (such as K - Means, HDBSCAN, OPTICS, BIRCH, GMM, and MS - AMS) using the optimized parameters. - Use quality measures such as the adjusted Rand index (ARI) and normalized mutual information (NMI) to evaluate the clustering effect and select the algorithm that best distinguishes between normal and abnormal states. 4. **Combine with classification models**: - Add the clustering results as additional features to the classification model to improve the accuracy of fault detection. - Use cross - validation techniques to evaluate model performance and compare the training time and accuracy with and without clustering features. Through the above methods, the authors successfully improved the accuracy of fault detection, with an average increase of 4.87%, and in some cases significantly reduced the training time. This indicates that using clustering features can more effectively manage and predict the operating status of high - pressure industrial compressors and reduce the risk of failure. ### Key formulas - **Autocorrelation matrix**: \[ \text{Correlation Matrix}=\text{corr}(X) \] where \(X\) is the feature matrix. - **F - value and p - value**: \[ F = \frac{\text{Between - group variance}}{\text{Within - group variance}} \] \[ p\text{-value}=P(F > F_{\text{observed}}) \] - **Silhouette coefficient (Silhouette Score)**: \[ s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))} \] where \(a(i)\) is the average distance from sample \(i\) to other samples in the same cluster, and \(b(i)\) is the average distance from sample \(i\) to samples in the nearest different cluster. - **Adjusted Rand index (ARI)**: \[ ARI=\frac{\sum_{ij}\binom{n_{ij}}{2}-[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}]/\binom{n}{2}}{\frac{1}{2}[\sum_i\binom{a_i}{2}+\sum_j\binom{b_j}{2}]-[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}]/\binom{n}{2}} \] where \(n_{ij}\) is simultaneously belonging to the \(i\) - th class