Abstract:Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using HuBERT's speech masking prediction to ensure the performance of the Transformer's deep-layer features on content tasks. In this way, we can progressively extract pitch variation, speaker, and content representations from the input speech. Finally, we can combine multiple representations with diverse speech information using different layer weights to obtain task-specific representations for various downstream tasks. Experimental results indicate that our proposed method achieves joint performance improvements on various tasks, such as speaker identification, speech recognition, emotion recognition, speech enhancement, and voice conversion, compared to excellent SSL methods such as wav2vec2.0, HuBERT, and WavLM.

Improving Speech Separation with Knowledge Distilled from Self-supervised Pre-trained Models

Investigating Self-Supervised Learning for Speech Enhancement and Separation

IMPROVING GENERALIZABILITY OF DISTILLED SELF-SUPERVISED SPEECH PROCESSING MODELS UNDER DISTORTED SETTINGS

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Unispeech-Sat: Universal Speech Representation Learning with Speaker Aware Pre-Training

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Self-Supervised Learning-Based Source Separation for Meeting Data

Why Does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection