Abstract:In this paper, we analyze the impact of five Arabic dialects on the front-end and pronunciation dictionary components of an Automatic Speech Recognition (ASR) system. We use ASR's phonetic decision tree as a diagnostic tool to compare the robustness of MFCC and MLP front-ends to dialectal variations in the speech data and found that MLP Bottle-Neck features are less robust to such variations. We also perform a rule-based analysis of the pronunciation dictionary, which enables us to identify dialectal words in the vocabulary and automatically generate pronunciations for unseen words. We show that our technique produces pronunciations with an average phone error rate 9.2%. Arabic language is characterized by its multitude of dialects. Although Modern Standard Arabic (MSA) is used in writing, TV/radio broadcasts and for formal communication, all informal communication is typically carried out in one of the regional dialects of Arabic. Dialectal variations influence the pronunciation dictionary, acoustic and language models in an ASR. Previous works on dialectal Arabic ASR include cross- dialectal data sharing (1), improved pronunciation and language modeling (2, 3), etc. In this paper, we describe our experiments on a dialectal Arabic speech database, where we focus on analyzing the behavior of different front-ends and pronunciation dictionary due to dialectal variations between speakers. We evaluate Mel-Frequency Cepstral Coefficients (MFCC) and Multi-Layer Perceptrons (MLP), on their ability to handle these variations that arise due to different dialects. Extending our previous work on gender normalization (4), we use phonetic decision trees as a diagnostic tool to analyze the influence of dialect in the clustered models. We introduce questions pertaining to dialect in addition to context in the building of the decision tree. We then build the tree to cluster the contexts and calculate the number of leaves that belong to branches with dialectal questions. The ratio of such 'dialectal' models to the total model size is used as a measure for dialect normalization. The higher the ratio, the more models are affected by the dialect, hence less normalization and vice versa. We further extend our analysis to the pronunciation dictionary, where we investigate ways to generate rule-based pronunciations for unseen words in a dialect with minimum manual effort. Our setup features a 'Pan-Arabic' dictionary, which contains pronunciations typically found in five Arabic dialects. We analyze the pronunciation variants in our common dictionary using acoustic model alignments to derive the dialect-specific pronunciations for each word. This forms the source of our rule-learning algorithm which maps word pronunciations from one dialect to another. These rules are then used to generate pronunciations for unseen words and the accuracy is estimated.

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Advancements in Arabic Text-to-Speech Systems: A 22-Year Literature Review

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

End-to-End Speech Recognition For Arabic Dialects

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Integrating Applied Linguistics With Artificial Intelligence-Enabled Arabic Text-To-Speech Synthesizer

NatiQ: An End-to-end Text-to-Speech System for Arabic

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers

Linguistic disparities in cross-language automatic speech recognition transfer from Arabic to Tashlhiyt

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

Zero-shot Cross-lingual Voice Transfer for TTS

Intelli-Z: Toward Intelligible Zero-Shot TTS

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

Exploring Speech Enhancement for Low-resource Speech Synthesis

Analysis of Dialectal Influence in Pan-Arabic ASR.

Dialectal Coverage And Generalization in Arabic Speech Recognition