Abstract:We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.

International Journal of Advanced Robotic Systems Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction Regular Paper

Audio-Visual Tibetan Speech Recognition Based On A Deep Dynamic Bayesian Network For Natural Human Robot Interaction Regular Paper

Tibetan Language Continuous Speech Recognition Based on Dynamic Bayesian Network

Unsupervised Tibetan speech features Learning based on Dynamic Bayesian Networks

DBN based models for audio-visual speech analysis and recognition

Mongolian acoustic modeling based on deep neural network

Speech Recognition Based on Deep Neural Networks on Tibetan Corpus

Dynamic bayesian networks for audio-visual speaker recognition

Tibetan Language Continuous Speech Recognition Based On Active Ws-Dbn

Tibetan Multi-Dialect Speech Recognition Using Latent Regression Bayesian Network and End-To-End Mode

Automatic Speaker Recognition Using Dynamic Bayesian Network.

Deep Audio-visual System for Closed-set Word-level Speech Recognition

Mongolian Speech Recognition Based on Deep Neural Networks

Dbn Based Multi-Stream Models For Speech

Deep Neural Network based Uyghur Large Vocabulary Continuous Speech Recognition

Multi-task Joint-Learning of Deep Neural Networks for Robust Speech Recognition

Tibetan Multi-dialect Speech and Dialect Identity Recognition

A Robust Visual Feature Extraction Based BTSM-LDA for Audio-Visual Speech Recognition

End-to-End-Based Tibetan Multitask Speech Recognition.

Research on the Algorithm of Tibetan Speech Recognition Based on DBN

Mongolian Text-to-Speech System Based on Deep Neural Network