Abstract:In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems' input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.

Maximum likelihood based estimation with quasi oppositional chemical reaction optimization algorithm for speech signal enhancement

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application

Multi-objective Approach to Speech Enhancement Using Tunable Q-Factor-based Wavelet Transform and ANN Techniques

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

Speech Enhancement Approach Based on Minimum Estimate and Spectral Subtraction

Attention-based Speech Enhancement Using Human Quality Perception Modelling

Speech Intelligibility Based Enhancement System Using Modified Deep Neural Network and Adaptive Multi-band Spectral Subtraction

Attention-Based Speech Enhancement Using Human Quality Perception Modeling

Dual-Stage Low-Complexity Reconfigurable Speech Enhancement

Real-time Spectrum Estimation–based Dual-Channel Speech-Enhancement Algorithm for Cochlear Implant

MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Speech Enhancement Algorithm Based on Spectral Subtraction

Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

A Refining Underlying Information Framework for Monaural Speech Enhancement

A Speech Intelligibility Enhancement Model based on Canonical Correlation and Deep Learning for Hearing-Assistive Technologies

Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time

Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier