Abstract:Abstract Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar levels of cross-dialect and cross-register acoustic variability, but these effects have been under-studied. This paper is an experimental analysis of the interaction between classical ASR corpus-compensation methods (feature selection, data selection, gender-dependent acoustic models) and the dialect-dependent/register-dependent variation among Arabic ASR corpora. The first interaction studied in this paper is that between acoustic recording quality and discrete pronunciation variation. Discrete pronunciation variation can be compensated by using grapheme-based instead of phone-based acoustic models, and by filtering out speakers with insufficient training data; the latter technique also helps to compensate for poor recording quality, which is further compensated by eliminating delta-delta acoustic features. All three techniques, together, reduce Word Error Rate (WER) by between 3.24% and 5.35%. The second aspect of dialect and register variation to be considered is variation in the fine-grained acoustic pronunciations of each phoneme in the language. Experimental results prove that gender and dialect are the principal components of variation in speech, therefore, building gender and dialect-specific models leads to substantial decreases in WER. In order to further explore the degree of acoustic differences between phone models required for each of the dialects of Arabic, cross-dialect experiments are conducted to measure how far apart Arabic dialects are acoustically in order to make a better decision about the minimal number of recognition systems needed to cover all dialectal Arabic. Finally, the research addresses an important question: how much training data is needed for building efficient speaker-independent ASR systems? This includes developing some learning curves to find out how large must the training set be to achieve acceptable performance.

End-to-End Speech Recognition For Arabic Dialects

Dialectal Arabic Speech Recognition using CNN-LSTM Based on End-to-End Deep Learning

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning

Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation

End-to-End Automatic Speech Recognition model for the Sudanese Dialect

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Embedded Learning Segmentation Approach for Arabic Speech Recognition

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition

Effective Deep Learning Models for Automatic Diacritization of Arabic Text

Arabic Speech Recognition: Advancement and Challenges

Investigations on Speech Recognition Systems for Low-Resource Dialectal Arabic-English Code-Switching Speech

Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic

Dialectal Coverage And Generalization in Arabic Speech Recognition

Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

Speaker independent recognition of low-resourced multilingual Arabic spoken words through hybrid fusion

Deep Learning, Ensemble and Supervised Machine Learning for Arabic Speech Emotion Recognition

Speech recognition challenge in the wild: Arabic MGB-3

Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia

Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach