Abstract:When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.

Iterative Pseudo-Labeling for Speech Recognition

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy

Pseudo-Labeling for Massively Multilingual Speech Recognition

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

Efficient Spoken Language Recognition via Multilabel Classification

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

Pseudo Label Is Better Than Human Label

Label Aware Speech Representation Learning For Language Identification

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Instruction-Following Speech Recognition