Abstract:Human infants can discover words directly from unsegmented speech signals without any explicitly labeled data. Current machine learning methods cannot efficiently estimate language model (LM) and acoustic model (AM) and discover words directly from continuous human speech signals in an unsupervised manner. To solve this problem, we propose an integrative generative model that combines an LM and an AM into a single generative model called the hierarchical Dirichlet process hidden LM (HDP-HLM). The HDP-HLM is obtained by extending the hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by Johnson et al. An inference procedure for the HDP-HLM is derived using the blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure enables the simultaneous and direct inference of LM and AM from continuous speech signals. Based on the HDP-HLM and its inference procedure, we develop a novel machine learning method called nonparametric Bayesian double articulation analyzer (NPB-DAA) that can directly acquire LM and AM from observed continuous speech signals. By assuming HDP-HLM as a generative model of observed time series data, and by inferring latent variables of the model, the method can analyze latent double articulation structure, i.e., hierarchically organized latent words and phonemes, of the data in an unsupervised manner. We also carried out two evaluation experiments using synthetic data and actual human continuous speech signals representing Japanese vowel sequences. In the word acquisition and phoneme categorization tasks, the NPB-DAA outperformed a conventional double articulation analyzer and baseline automatic speech recognition system whose AM was trained in a supervised manner. The main contributions of this paper are as follows: 1) we develop a probabilistic generative model that integrates LM and AM, i.e., HDP-HLM; 2) we derive an inference method for this, and propose the NPB-DAA; and 3) we show that the NPB-DAA can discover words directly from continuous human speech signals in an unsupervised manner.

Dual stream speech recognition using articulatory syllable models

Automatic Speech Recognition : A Study and Performance Evaluation on Neural Networks and Hidden Markov Models

Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery

Accent Recognition with Hybrid Phonetic Features

Dbn Based Multi-Stream Models For Speech

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition From Continuous Speech Signals

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

A Strategic Approach for Robust Dysarthric Speech Recognition

Dual-Branch Modeling Based on State-Space Model for Speech Enhancement

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Mixture Encoder for Joint Speech Separation and Recognition

Multi-Span Acoustic Modelling Using Raw Waveform Signals.

Bidirectional Representations for Low Resource Spoken Language Understanding

Modelling of a Speech-to-Text Recognition System for Air Traffic Control and NATO Air Command

The dual stream model of speech and language processing

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition

The Hidden Markov Model of co-articulation and its application to the continuous speech recognition

An Audio-Visual Speech Recognition Framework Based on Articulatory Features.