Abstract:Traditional fault diagnosis methods using Convolutional Neural Networks (CNNs) often struggle with capturing the temporal dynamics of vibration signals. To overcome this, the application of Transformer-based Vision Transformer (ViT) methods to fault diagnosis is gaining attraction. Nonetheless, these methods typically require extensive preprocessing, which increases computational complexity, potentially reducing the efficiency of the diagnosis process. Addressing this gap, this paper presents the Time Series Vision Transformer (TSViT), tailored for effective fault diagnosis. TSViT incorporates a convolutional layer to extract local features from vibration signals, alongside a transformer encoder to discern long-term temporal patterns. A thorough experimental comparison on three diverse datasets demonstrates TSViT's effectiveness and adaptability. Moreover, the paper delves into the influence of hyperparameter tuning on the model's performance, computational demand, and parameter count. Remarkably, TSViT achieves an unprecedented 100% average accuracy on two test sets and 99.99% on another, showcasing its exceptional diagnostic capabilities.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that traditional fault diagnosis methods have difficulty in capturing the time - dynamic characteristics of signals when dealing with vibration signals of rotating machinery. Specifically, although convolutional neural networks (CNNs) perform well in extracting local features, their convolutional filters limit their ability to capture global information and cannot effectively capture long - time - series dependencies. While the Vision Transformer (ViT) method based on Transformer has some improvements in capturing long - time - series dependencies, it usually requires a large number of pre - processing steps, which increases the computational complexity and reduces the efficiency of the diagnosis process.
To solve the above problems, this paper proposes the Time - Series Vision Transformer (TSViT) to improve the effectiveness and adaptability of fault diagnosis. TSViT realizes comprehensive spatial and temporal feature extraction by combining convolutional layers to extract local features and Transformer encoders to identify long - time - series time patterns. Experimental results show that TSViT performs well on three different datasets, with an average accuracy rate reaching an unprecedented 100%, demonstrating its excellent diagnostic ability.
### Main contributions:
1. **Propose the TSViT model**: A Time - Series Vision Transformer model for fault diagnosis that can directly process raw time - series signals.
2. **Develop the time - series patch embedding method**: Enable TSViT to accept one - dimensional or multi - dimensional time - domain signals as input, not just image data.
3. **Design experiments**: Conduct experiments using three different datasets, and the results show that TSViT can still achieve high - precision fault diagnosis without using any pre - processing techniques.
### Method overview:
- **Embedding layer**: Includes time - series patch embedding, class token, and position embedding.
- **Transformer encoder layer**: Consists of Multi - head Self - Attention, Multi - Layer Perceptron (MLP), Residual Connection, and Layer Normalization.
- **Classification layer**: Converts the feature maps extracted by the Transformer encoder into one - hot encoding for pattern recognition.
### Experimental results:
- **PBR dataset**: The loss functions of the training set and the test set gradually stabilize and finally approach 0; the accuracy rates of the training set and the test set gradually stabilize and finally approach 100%.
- **CWRU dataset**: In 10 trials, the maximum accuracy (MaxAcc) is 100%, the minimum accuracy (MinAcc) is 99.96%, and the average accuracy (AvgAcc) is 99.99%.
- **XJTU dataset**: In 10 trials, the maximum accuracy, the minimum accuracy, and the average accuracy are all 100%.
### Performance in a noisy environment:
- In actual industrial scenarios, the collected vibration signals usually contain different levels of noise. Research results show that even in a noisy environment, TSViT still performs well, especially when the dataset is large, the influence of noise is smaller.
### Hyper - parameter analysis:
- The paper also explores the influence of different hyper - parameter values on model performance, computational requirements, and the number of parameters, further verifying the robustness and effectiveness of TSViT.
In conclusion, TSViT effectively solves the limitations of traditional fault diagnosis methods in dealing with vibration signals of rotating machinery by combining convolutional layers and Transformer encoders, and improves the accuracy and efficiency of fault diagnosis.