Abstract:Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.

Multi-Speaker Pitch Tracking Via Embodied Self-Supervised Learning

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Learning Virtual HD Model for Bi-model Emotional Speaker Recognition

Toward Pitch-Insensitive Speaker Verification Via Soundfield

Toward Fully Self-Supervised Multi-Pitch Estimation

Single-channel speech separation integrating pitch information based on a multi task learning framework

Embodied Self-supervised Learning by Coordinated Sampling and Training

Robust Multipitch Estimation Of Piano Sounds Using Deep Spiking Neural Networks

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Multi-Pitch Detection for Co-Channel Speech Utilizing Frequency Channel Piecewise Integration and Morphological Feedback Verification Tracking

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

PGSS: Pitch-Guided Speech Separation.

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

IMPROVING MULTIMODAL SPEECH ENHANCEMENT BY INCORPORATING SELF-SUPERVISED AND CURRICULUM LEARNING

Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

Audio Mixing Inversion Via Embodied Self-supervised Learning