Abstract:Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.

The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training

How Does Critical Batch Size Scale in Pre-training?

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

More Speaking or More Speakers?

Linear-Complexity Self-Supervised Learning for Speech Processing

The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Improving Speaker Verification with Self-Pretrained Transformer Models

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

The Effect of Network Width on the Performance of Large-batch Training

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification

Scaling Law for Language Models Training Considering Batch Size