Abstract:Abstract Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar levels of cross-dialect and cross-register acoustic variability, but these effects have been under-studied. This paper is an experimental analysis of the interaction between classical ASR corpus-compensation methods (feature selection, data selection, gender-dependent acoustic models) and the dialect-dependent/register-dependent variation among Arabic ASR corpora. The first interaction studied in this paper is that between acoustic recording quality and discrete pronunciation variation. Discrete pronunciation variation can be compensated by using grapheme-based instead of phone-based acoustic models, and by filtering out speakers with insufficient training data; the latter technique also helps to compensate for poor recording quality, which is further compensated by eliminating delta-delta acoustic features. All three techniques, together, reduce Word Error Rate (WER) by between 3.24% and 5.35%. The second aspect of dialect and register variation to be considered is variation in the fine-grained acoustic pronunciations of each phoneme in the language. Experimental results prove that gender and dialect are the principal components of variation in speech, therefore, building gender and dialect-specific models leads to substantial decreases in WER. In order to further explore the degree of acoustic differences between phone models required for each of the dialects of Arabic, cross-dialect experiments are conducted to measure how far apart Arabic dialects are acoustically in order to make a better decision about the minimal number of recognition systems needed to cover all dialectal Arabic. Finally, the research addresses an important question: how much training data is needed for building efficient speaker-independent ASR systems? This includes developing some learning curves to find out how large must the training set be to achieve acceptable performance.

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

Exploring Character Trigrams for Robust Arabic Text Classification: A Comparative Analysis in the Face of Vocabulary Expansion and Misspelled Words

Heterogeneous Ensemble Deep Learning Model for Enhanced Arabic Sentiment Analysis

Improving Sentiment Analysis in Arabic Using Word Representation

ArabBert-LSTM: improving Arabic sentiment analysis based on transformer model and Long Short-Term Memory

Post-hoc analysis of Arabic transformer models

The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Effect of Word Embedding Variable Parameters on Arabic Sentiment Analysis Performance

BERT-Based Joint Model for Aspect Term Extraction and Aspect Polarity Detection in Arabic Text

A Comparative Study of Deep Learning Approaches for Arabic Language Processing

Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition

AraXLNet: pre-trained language model for sentiment analysis of Arabic

Natural Language Processing for Arabic Sentiment Analysis: A Systematic Literature Review

Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Tokenization Falling Short: On Subword Robustness in Large Language Models

Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing

Tokenizer Choice For LLM Training: Negligible or Crucial?

Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic