Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

Ziyang Ma,Zhisheng Zheng,Guanrou Yang,Yu Wang,Chao Zhang,Xie Chen
2023-06-15
Abstract:The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task. However, simply clustering features as targets by k-means does not fully inspire the model's performance. In this work, we present an unsupervised method to improve SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training. Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training. Furthermore, our models equipped with context-dependent units even outperform target-improvement models that use labeled data during pre-training. How we progressively improve the unit discovery process is demonstrated through experiments.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address several key issues in self-supervised learning (SSL) for speech representation learning, particularly focusing on how to improve the quality of targets during the pre-training phase to further enhance the model's generalization ability and performance on downstream tasks. Specifically, the paper proposes two models—MonoBERT and PolyBERT—both designed to enhance the learning capability of speech representations by improving the targets of self-supervised learning. These improved targets are primarily based on phoneme information, whereas the traditional HuBERT model generates targets through offline clustering. 1. **MonoBERT**: Uses frame-level monophone pseudo-units generated from a modified version of wav2Vec-U 2.0 as the targets for self-supervised learning. This simple improvement has already shown performance gains in automatic speech recognition (ASR) tasks compared to HuBERT. 2. **PolyBERT**: Uses context-aware phoneme-based pseudo-units for pre-training. The authors explored four methods for generating context-aware units, including logical triphones, physical triphones, phoneme segments, and phoneme clustering. Among these, the physical triphone method (PolyBERT-PT) significantly outperformed the baseline models HuBERT and MonoBERT. The paper validates the effectiveness of the proposed methods through experiments, achieving significant performance improvements not only in ASR tasks but also in non-ASR tasks such as speaker identification, keyword recognition, intent classification, and emotion recognition. This demonstrates that the proposed models can produce speech representations with broad applicability.