Abstract:This paper provides a comparative performance analysis of both shallow and deep machine learning classifiers for speech recognition task using frame-level phoneme classification. Phoneme recognition is still a fundamental and equally crucial initial step toward automatic speech recognition (ASR) systems. Often conventional classifiers perform exceptionally well on domain-specific ASR systems having a limited set of vocabulary and training data in contrast to deep learning approaches. It is thus imperative to evaluate performance of a system using deep artificial networks in terms of correctly recognizing atomic speech units, i.e., phonemes in this case with conventional state-of-the-art machine learning classifiers. Two deep learning models - DNN and LSTM with multiple configuration architectures by varying the number of layers and the number of neurons in each layer on the OLLO speech corpora along with six shallow machine learning classifiers for Filterbank acoustic features are thoroughly studied. Additionally, features with three and ten frames temporal context are computed and compared with no-context features for different models. The classifier's performance is evaluated in terms of precision, recall, and F1 score for 14 consonants and 10 vowels classes for 10 speakers with 4 different dialects. High classification accuracy of 93% and 95% F1 score is obtained with DNN and LSTM networks respectively on context-dependent features for 3-hidden layers containing 1024 nodes each. SVM surprisingly obtained even a higher classification score of 96.13% and a misclassification error of less than 5% for consonants and 4% for vowels.

Research on deep neural network's hidden layers in phoneme recognition

How phonemes contribute to deep speaker models?

Dissecting neural computations in the human auditory pathway using deep neural networks for speech

Phonotactic language recognition based on DNN-HMM acoustic model

Dissecting neural computations of the human auditory pathway using deep neural networks for speech

Building DNN acoustic models for large vocabulary speech recognition

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Deep neural networks based speaker modeling at different levels of phonetic granularity

An Acoustic Model for English Speech Recognition Based on Deep Learning

A Study on the Performance Evaluation of Machine Learning Models for Phoneme Classification

The Appropriate Hidden Layers of Deep Belief Networks for Speech Recognition

What does a network layer hear? analyzing hidden representations of end-to-end asr through speech synthesis

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Phonetic Temporal Neural Model for Language Identification

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition

Research on acoustic Model of Putian Dialect Speech Recognition Based on Deep Learning

The Research of Acoustic Layer Recognition Based on Pinyin Model

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems