Deep Generic Representations for Domain-Generalized Anomalous Sound Detection

Phurich Saengthong,Takahiro Shinozaki
DOI: https://doi.org/10.48550/arXiv.2409.05035
2024-09-08
Abstract:Developing a reliable anomalous sound detection (ASD) system requires robustness to noise, adaptation to domain shifts, and effective performance with limited training data. Current leading methods rely on extensive labeled data for each target machine type to train feature extractors using Outlier-Exposure (OE) techniques, yet their performance on the target domain remains sub-optimal. In this paper, we present \textit{GenRep}, which utilizes generic feature representations from a robust, large-scale pre-trained feature extractor combined with kNN for domain-generalized ASD, without the need for fine-tuning. \textit{GenRep} incorporates MemMixup, a simple approach for augmenting the target memory bank using nearest source samples, paired with a domain normalization technique to address the imbalance between source and target domains. \textit{GenRep} outperforms the best OE-based approach without a need for labeled data with an Official Score of 73.79\% on the DCASE2023T2 Eval set and demonstrates robustness under limited data scenarios. The code is available open-source.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a reliable abnormal sound detection (ASD) system, especially in the face of the following challenges: 1. **Noise Robustness**: The system needs to be robust to noise in the input. 2. **Domain - shift Adaptability**: The system needs to be able to adapt to changes in different domains (for example, machine state changes due to changes in temperature or background noise). 3. **Effective Performance with Limited Training Data**: On newly installed devices, the system needs to work effectively with limited training data. The current leading methods rely on providing a large amount of labeled data for each target machine type to train feature extractors, using the Outlier - Exposure (OE) technique. However, the performance of these methods in the target domain is still not satisfactory. Specifically, existing methods face two main challenges: - Sufficient normal data is required for training in the source and target domains, and hyper - parameters need to be carefully adjusted to prevent over - fitting and capture irrelevant noise. - A large amount of labeled data is required, which may be impractical or infeasible for practical applications. To solve these problems, the authors propose the GenRep model, which uses a large - scale pre - trained feature extractor to generate general feature representations and combines kNN for domain - generalized abnormal sound detection without fine - tuning. GenRep improves the robustness and performance of the system by introducing MemMixup and Domain Normalization (DN) to balance the feature distribution differences between the source and target domains. ### Specific Problems and Solutions 1. **How to obtain robust feature representations without a large amount of labeled data?** - GenRep uses large - scale pre - trained audio models such as BEATs to generate general feature representations without additional fine - tuning. 2. **How to deal with the imbalance problem between the source and target domains?** - GenRep introduces the MemMixup method, which enhances target features by interpolating the nearest source features, thereby balancing the feature distribution between the source and target domains. 3. **How to maintain performance in the case of domain shift?** - GenRep applies Domain Normalization (DN), which standardizes the score distributions of different domains through Z - score normalization, thereby reducing the impact of domain shift. Through these methods, GenRep achieved an official score of 73.79% on the DCASE2023T2 evaluation set and demonstrated strong robustness in limited - data scenarios.