Abstract:Background: In today's world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people's daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods: In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model's performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results: Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models

Efficient Compression of Multitask Multilingual Speech Models

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance

Comparison of Pre-trained Language Models for Turkish Address Parsing

Customized deep learning based Turkish automatic speech recognition system supported by language model

Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications

Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training

The Comparison of Language Models with a Novel Text Filtering Approach for Turkish Sentiment Analysis

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition