Abstract:Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.

Improving Acoustic Scene Classification Via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer

Semi-Supervised Acoustic Scene Classification with Test-Time Adaptation

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

The NERCSLIP-USTC System for Semi-Supervised Acoustic Scene Classification of ICME 2024 Grand Challenge

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Domain Adaptation Transformer for Unsupervised Driving-Scene Segmentation in Adverse Conditions

An Investigation of Transfer Learning Mechanism for Acoustic Scene Classification

Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

Long-term scalogram integrated with an iterative data augmentation scheme for acoustic scene classification

Deep semantic learning for acoustic scene classification

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

A Hybrid Approach to Acoustic Scene Classification Based on Universal Acoustic Models.

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments

A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification

Hierarchical classification for acoustic scenes using deep learning

Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling