Abstract:State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.

Tibetan Multi-Dialect Speech Recognition Using Latent Regression Bayesian Network and End-To-End Mode

Improving Limited Resource Speech Recognition Performance with Latent Regression Bayesian Network

Tibetan Language Continuous Speech Recognition Based on Dynamic Bayesian Network

Audio-Visual Tibetan Speech Recognition Based On A Deep Dynamic Bayesian Network For Natural Human Robot Interaction Regular Paper

End-to-End-Based Tibetan Multitask Speech Recognition.

Tibetan-Mandarin Bilingual Speech Recognition Based on End-to-end Framework

Tibetan Language Continuous Speech Recognition Based On Active Ws-Dbn

International Journal of Advanced Robotic Systems Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction Regular Paper

Tibetan Multi-dialect Speech and Dialect Identity Recognition

Improving Minority Language Speech Recognition Based on Distinctive Features

Lhasa Dialect Recognition of Different Phonemes Based on TDNN Method.

Speech Recognition Based on Deep Neural Networks on Tibetan Corpus

Multi-task Recurrent Model for True Multilingual Speech Recognition

Cross-Language Transfer Learning-based Lhasa-Tibetan Speech Recognition

Multilingual Articulatory Features Augmentation Learning

Bayesian Neural Network Language Modeling for Speech Recognition

Speaker recognition of Yunnan minority accent Based on bayesian network

Effective Training End-to-End ASR systems for Low-resource Lhasa Dialect of Tibetan Language

Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Deep Neural Network based Uyghur Large Vocabulary Continuous Speech Recognition

Mongolian Speech Recognition Based on Deep Neural Networks