Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

Oyku Berfin Mercan,Sercan Cepni,Davut Emre Tasar,Sukru Ozan
2023-07-07
Abstract:In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of technical performance comparison in the field of Turkish Speech To Text (STT). Specifically, the paper evaluates the accuracy and applicability of two popular multilingual pre-trained models—Whisper-Small and Wav2Vec2-XLS-R-300M—in converting speech to text in Turkish. The main objectives include: 1. Fine-tuning the two models using the Mozilla Common Voice 11.0 dataset and comparing their performance. 2. Testing the performance of these two models in different scenarios, especially on datasets containing call center recordings. 3. Providing guidance on how to choose the best model suitable for Turkish STT tasks. Through a systematic comparative study of these models, the paper offers important reference points for researchers and practitioners in the field of Turkish STT.