Abstract:Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the model consistency problem in the machine Anomaly Sound Detection (ASD) task. Specifically, although large - scale pre - trained models have demonstrated excellent performance in multiple fields, in the machine Anomaly Sound Detection task, few works can report satisfactory results. This may be due to the inconsistency between the pre - trained model and the machine audio data, resulting in a mismatch in data and architecture. ### Main problems 1. **Model consistency problem**: Existing pre - trained models perform poorly when applied to machine Anomaly Sound Detection, mainly because there is an inconsistency between the pre - training stage and the fine - tuning stage. 2. **Inconsistency in data and architecture**: Machine audio data has different characteristics (such as being sparser and more stable) compared to speech data, which makes the direct use of pre - trained models designed for speech ineffective. ### Solutions To solve the above problems, the authors propose the AnoPatch model, and its core ideas include: - **ViT backbone network**: Use ViT (Vision Transformer) as the backbone network to extract patch - level representations from mel - spectrograms. - **Pre - training and fine - tuning**: Initialize the parameters of the ViT backbone network from an audio classification model BEATs pre - trained on AudioSet, and perform fine - tuning on machine audio data. - **Enhanced fine - tuning tasks**: Strengthen the fine - tuning tasks by classifying machine - related metadata (such as machine type, entity ID, speed, etc.), and use the ArcFace loss function to further improve the model performance. - **Anomaly detection**: In the detection stage, all patch - level representations are combined into a general embedding, and then the KNN algorithm is used for anomaly detection. ### Experimental results Through experiments on two datasets, DCASE 2020 and DCASE 2023, AnoPatch has demonstrated state - of - the - art performance, proving its effectiveness in the machine Anomaly Sound Detection task. ### Summary The main contributions of this paper are: - Proposing the AnoPatch model, which has achieved significant performance improvement in the machine Anomaly Sound Detection task by improving the consistency of the pre - trained model. - Proving through experiments that better consistency can significantly improve the performance of pre - trained models in the machine Anomaly Sound Detection task. ### Formula display The ArcFace loss function used in the fine - tuning process is as follows: \[ L = -\frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{s\cos(\theta_{y_i}+m)}}{e^{s\cos(\theta_{y_i}+m)}+\sum_{j = 1, j\neq y_i}^{c}e^{s\cos\theta_j}} \] where: - \( N \) is the number of samples, - \( s \) and \( m \) are hyper - parameters, - \( \theta_j=\arccos\left(\frac{W_j^T x_i}{\|W_j\|_2\|x_i\|_2}\right) \), - \( W_j \) is the \( j \) - th column of the classification head weight matrix, - \( x_i \) is the final output of sample \( i \). These improvements make AnoPatch perform excellently when dealing with the machine Anomaly Sound Detection task.

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models

Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection

Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods

Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining

Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup Anomalous Sound Detection

Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring

Domain Shift-oriented Machine Anomalous Sound Detection Model Based on Self-Supervised Learning

Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning

Anomaly sound detection of industrial devices by using teacher-student incremental continual learning

Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

An Experimental Study on Sound Event Localization and Detection under Realistic Testing Conditions

Description and Discussion on DCASE 2023 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Regularized Contrastive Masked Autoencoder Model for Machinery Anomaly Detection Using Diffusion-Based Data Augmentation

Patch-wise Auto-Encoder for Visual Anomaly Detection

Adaptive data augmentation for mandarin automatic speech recognition

ASD-Diffusion: Anomalous Sound Detection with Diffusion Models

SoftPatch: Unsupervised Anomaly Detection with Noisy Data

Representation Learning Using Machine Attribute Information for Anomalous Sound Detection in Real Scenarios

A Study on Joint Modeling and Data Augmentation of Multi-Modalities for Audio-Visual Scene Classification

Domestic sound event detection by shift consistency mean-teacher training and adversarial domain adaptation