Enhancing amazigh ASR through convolutional neural networks and MFCC

Hossam Boulal,Mohamed Hamidi,Jamal Barkani,Mustapha Abarkan
DOI: https://doi.org/10.1007/s11042-024-20451-0
IF: 2.577
2024-11-16
Multimedia Tools and Applications
Abstract:In this study, we developed a speech recognition system for the Amazigh language, specifically targeting the recognition of the initial ten numbers. The system employs four Convolutional Neural Network (CNN) models, including three custom-designed models and a pre-trained VGG19 model. Our experiments utilized a dataset comprising 4200 audio files recorded by 42 distinct speakers, with input features extracted as Mel Frequency Cepstral Coefficients (MFCCs). We tested three normalization methods: no normalization, Cepstral Mean and Variance Normalization (CMVN), and Min-Max normalization. While CMVN generally provided effective standardization, We achieved the highest accuracy of 97.56% using Min-Max normalization with a specific filter size in the third custom CNN model. The VGG19 model, however, showed suboptimal performance. These findings underscore the significance of selecting suitable normalization techniques and model architectures for enhancing speech recognition accuracy.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?