Abstract:The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task. However, simply clustering features as targets by k-means does not fully inspire the model's performance. In this work, we present an unsupervised method to improve SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training. Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training. Furthermore, our models equipped with context-dependent units even outperform target-improvement models that use labeled data during pre-training. How we progressively improve the unit discovery process is demonstrated through experiments.

What problem does this paper attempt to address?

The paper aims to address several key issues in self-supervised learning (SSL) for speech representation learning, particularly focusing on how to improve the quality of targets during the pre-training phase to further enhance the model's generalization ability and performance on downstream tasks. Specifically, the paper proposes two models—MonoBERT and PolyBERT—both designed to enhance the learning capability of speech representations by improving the targets of self-supervised learning. These improved targets are primarily based on phoneme information, whereas the traditional HuBERT model generates targets through offline clustering. 1. **MonoBERT**: Uses frame-level monophone pseudo-units generated from a modified version of wav2Vec-U 2.0 as the targets for self-supervised learning. This simple improvement has already shown performance gains in automatic speech recognition (ASR) tasks compared to HuBERT. 2. **PolyBERT**: Uses context-aware phoneme-based pseudo-units for pre-training. The authors explored four methods for generating context-aware units, including logical triphones, physical triphones, phoneme segments, and phoneme clustering. Among these, the physical triphone method (PolyBERT-PT) significantly outperformed the baseline models HuBERT and MonoBERT. The paper validates the effectiveness of the proposed methods through experiments, achieving significant performance improvements not only in ASR tasks but also in non-ASR tasks such as speaker identification, keyword recognition, intent classification, and emotion recognition. This demonstrates that the proposed models can produce speech representations with broad applicability.

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning

Investigating Self-Supervised Learning for Speech Enhancement and Separation

SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation

SUPERB: Speech Processing Universal PERformance Benchmark

Improved Speech Pre-Training with Supervision-Enhanced Acoustic Unit

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Towards Robust Speech Representation Learning for Thousands of Languages

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction