Bidirectional RNN for Audio Deep Learning in an End-to-End Model

R. Deekshitha,M. Aishwaryalakshmi,G. Janani Sree,S. Muthulakshmi
DOI: https://doi.org/10.1109/icces54183.2022.9835945
2022-06-22
Abstract:ASR (automated speech recognition) is a feature that enables users of information systems to input data by speaking rather than punching numbers into a terminal. These advancements empower machines to answer human voices precisely and reliably, permitting to convey helpful and significant services. Individuals would pick such a gadget since speaking with a machine through voice is quicker than utilizing a console. Since communicating in language rules human correspondence, it’s just typical for individuals to anticipate voice interfaces for PCs. This can be accomplished by making discourse-to-message programming that empowers a gadget to change over voice orders and correspondence into messages. The acoustic model, language model, and vocabulary model are the three models of a run-of-the-mill ASR framework. Various examples of voice, the environment, which requires foundation commotion, and the speaker’s intonation are altogether hindrances in automatic speech recognition. The key idea is to investigate the spectrogram and MFCC highlights of information sound signals and make state-of-the-art profound learning models. It prepared an acoustic model utilizing RNN-GRU (Recurrent Neural Network-Gated Recurrent Unit) in the Librispeech dataset and then executed a language model utilizing Bert to improve the presentation of the acoustic model carried out. Whenever prepared from start to finish with reasonable regularization, it observes that RNN-GRU accomplishes a higher word blunder rate. When compared to a higher word blunder rate, a lower word blunder rate implies superior discourse acknowledgement execution. In automatic speech recognition, profound learning procedures make it simpler to diminish the pace of word mistake rate.
Computer Science
What problem does this paper attempt to address?