Self-Supervised Learning for Multi-Channel Neural Transducer

Atsushi Kojima

2024-08-06

Abstract:Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we explored a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework. As the multi-channel end-to-end ASR model, we focused on a multi-channel neural transducer. In pre-training, we compared three different methods for feature quantization to train a multi-channel conformer audio encoder: joint quantization, feature-wise quantization and channel-wise quantization. In fine-tuning, we trained the multi-channel conformer-transducer. All experiments were conducted using the far-field in-house and CHiME-4 datasets. The results of the experiments showed that feature-wise quantization was the most effective among the methods. We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily explores how to apply self-supervised learning methods to multi-channel end-to-end automatic speech recognition (ASR) models, particularly methods based on the wav2vec 2.0 framework. The core objective of the research is to improve speech recognition performance in far-field environments, especially enhancing the system's robustness in noisy conditions. Specifically, the authors investigate a method of training a multi-channel neural transducer based on the wav2vec 2.0 framework and compare three different feature quantization methods during the pre-training phase: joint quantization, feature-level quantization, and channel-level quantization. These methods are used to train a multi-channel Conformer audio encoder. Experimental results show that on the far-field internal dataset used, the feature-level quantization method performs the best, reducing the relative character error rate by 66% compared to the model without pre-training. Additionally, good results were also achieved on the publicly available CHiME-4 dataset, with a relative character error rate reduction of 4.2%. Although this improvement is smaller compared to the far-field internal dataset, it may be due to the smaller size of the CHiME-4 dataset. In summary, by adopting self-supervised learning methods, particularly the feature-level quantization strategy, the performance of multi-channel end-to-end ASR models in complex noisy environments can be significantly improved.

Self-Supervised Learning for Multi-Channel Neural Transducer

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Progressive Multi-scale Self-supervised Learning for Speech Recognition

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Multi-Span Acoustic Modelling Using Raw Waveform Signals.

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Improving Speech Decoding from ECoG with Self-Supervised Pretraining

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Unsupervised Multi-channel Separation and Adaptation