Abstract:Even though Automatic Speech Recognition (ASR) systems significantly improved over the last decade, they still introduce a lot of errors when they transcribe voice to text. One of the most common reasons for these errors is phonetic confusion between similar-sounding expressions. As a result, ASR transcriptions often contain "quasi-oronyms", i.e., words or phrases that sound similar to the source ones, but that have completely different semantics (e.g., "win" instead of "when" or "accessible on defecting" instead of "accessible and affecting"). These errors significantly affect the performance of downstream Natural Language Understanding (NLU) models (e.g., intent classification, slot filling, etc.) and impair user experience. To make NLU models more robust to such errors, we propose novel phonetic-aware text representations. Specifically, we represent ASR transcriptions at the phoneme level, aiming to capture pronunciation similarities, which are typically neglected in word-level representations (e.g., word embeddings). To train and evaluate our phoneme representations, we generate noisy ASR transcriptions of four existing datasets - Stanford Sentiment Treebank, SQuAD, TREC Question Classification and Subjectivity Analysis - and show that common neural network architectures exploiting the proposed phoneme representations can effectively handle noisy transcriptions and significantly outperform state-of-the-art baselines. Finally, we confirm these results by testing our models on real utterances spoken to the Alexa virtual assistant.

Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling

A High Accuracy Approach for Word-Phoneme Translation Using Neural Networks

MANDARIN PRONUNCIATION VARIATION MODELING 1

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Mandarin Pronunciation Modeling Based on CASS Corpus.

Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning

Modeling Pronunciation Variation Using Context-Dependent Weighting and B/s Refined Acoustic Modeling.

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

A two-stage mispronunciation detection approach for computer-assisted pronunciation training

Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

Applying Multitask Learning To Acoustic-Phonemic Model For Mispronunciation Detection And Diagnosis In L2 English Speech

Using Phoneme Representations to Build Predictive Models Robust to ASR Errors

Merging of British and American accents for embedded pronunciation scoring applications

Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Integrating Articulatory Features into Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech.

Improve Mispronunciation Detection with Tandem Feature