Abstract:<p class="a-plus-plus">Nowadays, sensory organs are becoming essential means for controlling modern machines which require human intervention. Among these means, we can cite the sense of voice which can be used to control and monitor modern interfaces. In this regard, Automatic Speech Recognition (ASR) is mainly explored to accomplish many tasks, such as translating natural voice into computer text and performing actions based on human commands. In this paper, a system for recognizing spoken Arabic numerals and words based on two classification methods is proposed. The first classification approach is a combination of Convolutional Neural Network (CNN) with Long Short-Term Memory (LSTM) and Fully Connected (FC) network (CNN-LSTM-FC), while the second is based on the conventional Dense Network (DenseNet). These classification approaches are integrated into the proposed Arabic speech recognition system to perform the classification task by exploring uniform length sequences of speech utterances extracted from the Mel-frequency Cepstral Coefficients (MFCCs). Regarding the CNN-LSTM-FC approach, it is offered with the objective of learning high-level features that contain long-term contextual dependencies and local information. These features include less information than raw data, which helps to reduce the training time. Also, the CNN-LSTM-FC method allows capturing global contextual information and local correlation results from MFCC coefficients. With respect to the DenseNet model, it is explored to benefit from the direct connections <span class="a-plus-plus inline-equation id-i-eq1"><span class="a-plus-plus equation-source format-t-e-x"><span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="6.455ex" height="4.176ex" style="vertical-align: -1.171ex;" viewBox="0 -1293.7 2779 1798" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"><g transform="translate(120,0)"><rect stroke="none" width="2539" height="60" x="0" y="220"></rect><g transform="translate(60,622)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-4C" x="0" y="0"></use><g transform="translate(481,0)"> <use transform="scale(0.707)" xlink:href="#MJMAIN-28" x="0" y="0"></use><g transform="translate(275,0)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-4C" x="0" y="0"></use> <use transform="scale(0.707)" xlink:href="#MJMAIN-2B" x="681" y="0"></use> <use transform="scale(0.707)" xlink:href="#MJMAIN-31" x="1460" y="0"></use></g> <use transform="scale(0.707)" xlink:href="#MJMAIN-29" x="2350" y="0"></use></g></g> <use transform="scale(0.707)" xlink:href="#MJMAIN-32" x="1545" y="-589"></use></g></g></svg></span></span></span> between its layers in addition to its ability to alleviate the problem of the vanishing of gradient and the reduction in the number of its explored parameters. The training time is therefore reduced. Our models were evaluated on two databases: The first is a database of English voice commands, while the second is that of spoken Arabic numerals and words. Experimental tests showed that the CNN-LSTM-FC model with MFCC coefficients performed best on the database of spoken Arabic numerals and words in terms of evaluated performances (accuracy = 88.04%, precision = 88.56%, recall = 87.78%, <em class="a-plus-plus">F</em>1 = 88.17, and error = 1.10%) compared to those obtained with the DenseNet model. Additionally, the best results on the database of English voice command for precision (87.15%), <em class="a-plus-plus">F</em>1 (85.66), and error (0.58%) were obtained by the CNN-LSTM-FC model, while those for accuracy (85.40%) and recall (85.40%) were achieved using the DenseNet model. Even the two proposed models led to acceptable results on both databases; however, they require less computation to achieve higher performance.</p><svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMATHI-4C" d="M228 637Q194 637 192 641Q191 643 191 649Q191 673 202 682Q204 683 217 683Q271 680 344 680Q485 680 506 683H518Q524 677 524 674T522 656Q517 641 513 637H475Q406 636 394 628Q387 624 380 600T313 336Q297 271 279 198T252 88L243 52Q243 48 252 48T311 46H328Q360 46 379 47T428 54T478 72T522 106T564 161Q580 191 594 228T611 270Q616 273 628 273H641Q647 264 647 262T627 203T583 83T557 9Q555 4 553 3T537 0T494 -1Q483 -1 418 -1T294 0H116Q32 0 32 10Q32 17 34 24Q39 43 44 45Q48 46 59 46H65Q92 46 125 49Q139 52 144 61Q147 65 216 339T285 628Q285 635 228 637Z"></path><path stroke-width="1" id="MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z"></path><path stroke-width="1" id="MJMAIN-2B" d="M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z"></path><path stroke-width="1" id="MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z"></path><path stroke-width="1" id="MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z"></path><path stroke-width="1" id="MJMAIN-32" d="M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z"></path></defs></svg>

Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method

Enhancing amazigh ASR through convolutional neural networks and MFCC

Maghrebian dialect recognition based on support vector machines and neural network classifiers

Spoken Utterance Classification Task of Arabic Numerals and Selected Isolated Words

Amazigh audiovisual speech recognition system design

Multimodal Emotional Classification Based on Meaningful Learning

CONVOLUTIONAL NEURAL NETWORK FOR ARABIC SPEECH RECOGNITION

A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

Automatic Speaker Recognition Using Mel-Frequency Cepstral Coefficients Through Machine Learning

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Dialectal Arabic Speech Recognition using CNN-LSTM Based on End-to-End Deep Learning

BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech

Employing Hybrid Deep Neural Networks on Dari Speech

Deep neural network architectures for dysarthric speech analysis and recognition

A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortion

Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network