SIM-BERT: Speech intelligence model using NLP-BERT with improved accuracy

Pankaj Kumar,Sanjib Kumar Sahu
DOI: https://doi.org/10.1201/9781003150664-48
2021-06-24
Abstract:Speech recognition model and language understanding is the most critical task when it comes to understanding the language models (LMs). At present, various end-to-end learning model has been used for speech recognition using unidirectional and bidirectional language models. Despite their theoretical advantages over conventional unidirectional and bidirectional approach, it has been found that the accuracy is not improved. Using BERT (Bidirectional Encoder Representations from Transformers), which is recently proposed pre-trained language representation model from Google’s AI team, consists of multi-layer bidirectional Transformer encoder, provides much better accuracy than only using unidirectional or bidirectional approach with huge corpus of training data. Whereas, NLP (natural language processing), is used for language understanding (LU) and language generation (LG). So, in this study, we have designed a model to extract the text from speech, based on classification ranking and then use BERT to analyze the context and semantic of the entire sentence of top-ranked sentences. BERT uses bidirectional approach to understand the semantics of the words in a sentence from both left and 440right directions and provides most relevant score based on the meaning of entire sentence and words around it. This has been observed that using pre-trained model decreases the processing time and, increases the accuracy and turnaround time for end-to-end speech recognition system. This chapter discusses the SIM-BERT model is useful in analyzing the audio signal, extracting the embedded text, analyzing the relevant information using language model, and then constructing an audio signal as an output to user. The SIM-BERT model is fine-tuned to minimize the loss for predicting the correct starting index and ending index of the output audio words. Speech recognition model and language understanding is the most critical task when it comes to understanding the language models. At present, various end-to-end learning model has been used for speech recognition using unidirectional and bidirectional language models. A speech intelligence system is a pre-trained model which is used to identify audio signal with better accuracy against noise, context, and semantic representations, and then use its pre-trained NLP model to understand process text using natural language processing and then extract the output into a relevant audio signal with great accuracy and precision.
What problem does this paper attempt to address?