Abstract:Speech recognition model and language understanding is the most critical task when it comes to understanding the language models (LMs). At present, various end-to-end learning model has been used for speech recognition using unidirectional and bidirectional language models. Despite their theoretical advantages over conventional unidirectional and bidirectional approach, it has been found that the accuracy is not improved. Using BERT (Bidirectional Encoder Representations from Transformers), which is recently proposed pre-trained language representation model from Google’s AI team, consists of multi-layer bidirectional Transformer encoder, provides much better accuracy than only using unidirectional or bidirectional approach with huge corpus of training data. Whereas, NLP (natural language processing), is used for language understanding (LU) and language generation (LG). So, in this study, we have designed a model to extract the text from speech, based on classification ranking and then use BERT to analyze the context and semantic of the entire sentence of top-ranked sentences. BERT uses bidirectional approach to understand the semantics of the words in a sentence from both left and 440right directions and provides most relevant score based on the meaning of entire sentence and words around it. This has been observed that using pre-trained model decreases the processing time and, increases the accuracy and turnaround time for end-to-end speech recognition system. This chapter discusses the SIM-BERT model is useful in analyzing the audio signal, extracting the embedded text, analyzing the relevant information using language model, and then constructing an audio signal as an output to user. The SIM-BERT model is fine-tuned to minimize the loss for predicting the correct starting index and ending index of the output audio words. Speech recognition model and language understanding is the most critical task when it comes to understanding the language models. At present, various end-to-end learning model has been used for speech recognition using unidirectional and bidirectional language models. A speech intelligence system is a pre-trained model which is used to identify audio signal with better accuracy against noise, context, and semantic representations, and then use its pre-trained NLP model to understand process text using natural language processing and then extract the output into a relevant audio signal with great accuracy and precision.

SIM-BERT: Speech intelligence model using NLP-BERT with improved accuracy

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Innovative Bert-based Reranking Language Models for Speech Recognition

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

End-to-End Speech Recognition with Pre-trained Masked Language Model

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Understanding Semantics from Speech Through Pre-training

Language Model Estimation For Optimizing End-To-End Performance Of A Natural Language Call Routing System

lamBERT: Language and Action Learning Using Multimodal BERT

Classification of Conversational Sentences Using an Ensemble Pre-Trained Language Model with the Fine-Tuned Parameter

Evaluating Biomedical BERT Models for Vocabulary Alignment at Scale in the UMLS Metathesaurus

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering

MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

CM-BERT