Anbai Jiang,Bing Han,Zhiqiang Lv,Yufeng Deng,Wei-Qiang Zhang,Xie Chen,Yanmin Qian,Jia Liu,Pingyi Fan
Abstract:Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the model consistency problem in the machine Anomaly Sound Detection (ASD) task. Specifically, although large - scale pre - trained models have demonstrated excellent performance in multiple fields, in the machine Anomaly Sound Detection task, few works can report satisfactory results. This may be due to the inconsistency between the pre - trained model and the machine audio data, resulting in a mismatch in data and architecture.
### Main problems
1. **Model consistency problem**: Existing pre - trained models perform poorly when applied to machine Anomaly Sound Detection, mainly because there is an inconsistency between the pre - training stage and the fine - tuning stage.
2. **Inconsistency in data and architecture**: Machine audio data has different characteristics (such as being sparser and more stable) compared to speech data, which makes the direct use of pre - trained models designed for speech ineffective.
### Solutions
To solve the above problems, the authors propose the AnoPatch model, and its core ideas include:
- **ViT backbone network**: Use ViT (Vision Transformer) as the backbone network to extract patch - level representations from mel - spectrograms.
- **Pre - training and fine - tuning**: Initialize the parameters of the ViT backbone network from an audio classification model BEATs pre - trained on AudioSet, and perform fine - tuning on machine audio data.
- **Enhanced fine - tuning tasks**: Strengthen the fine - tuning tasks by classifying machine - related metadata (such as machine type, entity ID, speed, etc.), and use the ArcFace loss function to further improve the model performance.
- **Anomaly detection**: In the detection stage, all patch - level representations are combined into a general embedding, and then the KNN algorithm is used for anomaly detection.
### Experimental results
Through experiments on two datasets, DCASE 2020 and DCASE 2023, AnoPatch has demonstrated state - of - the - art performance, proving its effectiveness in the machine Anomaly Sound Detection task.
### Summary
The main contributions of this paper are:
- Proposing the AnoPatch model, which has achieved significant performance improvement in the machine Anomaly Sound Detection task by improving the consistency of the pre - trained model.
- Proving through experiments that better consistency can significantly improve the performance of pre - trained models in the machine Anomaly Sound Detection task.
### Formula display
The ArcFace loss function used in the fine - tuning process is as follows:
\[
L = -\frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{s\cos(\theta_{y_i}+m)}}{e^{s\cos(\theta_{y_i}+m)}+\sum_{j = 1, j\neq y_i}^{c}e^{s\cos\theta_j}}
\]
where:
- \( N \) is the number of samples,
- \( s \) and \( m \) are hyper - parameters,
- \( \theta_j=\arccos\left(\frac{W_j^T x_i}{\|W_j\|_2\|x_i\|_2}\right) \),
- \( W_j \) is the \( j \) - th column of the classification head weight matrix,
- \( x_i \) is the final output of sample \( i \).
These improvements make AnoPatch perform excellently when dealing with the machine Anomaly Sound Detection task.