Augmenting Polish Automatic Speech Recognition System With Synthetic Data

Łukasz Bondaruk,Jakub Kubiak,Mateusz Czyżnikiewicz
2024-10-30
Abstract:This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
This paper attempts to solve the problem of insufficient performance in the Polish Automatic Speech Recognition (ASR) system due to the limited amount of high - quality labeled data. Specifically, the author aims to enhance the training data set by introducing synthetic data to overcome the problem of scarce natural speech resources and improve the performance of the model. ### Problem Background 1. **Data Scarcity**: For less widely - used languages such as Polish, high - quality labeled speech data is very limited, which makes it difficult to train effective ASR systems. 2. **Application of Synthetic Data**: In recent years, speech synthesis technology has made significant progress and can generate high - quality synthetic speech data. These synthetic data can be used to enhance real - world data sets, thereby improving the performance of ASR systems. ### Solution The author proposes a Voicebox - based speech synthesis pipeline, using synthetic data to enhance the training of two ASR models, Conformer and Whisper. Specific methods include: - **Speech Synthesis**: Use Voicebox to generate synthetic speech data. This system can generate high - quality and diverse synthetic speech. - **Data Augmentation**: Mix synthetic data with real data to form a new training data set to increase the diversity and quantity of data. - **Model Training**: Train Conformer and Whisper models on the enhanced data set and evaluate their performance improvements. ### Experimental Results The experimental results show that after adding synthetic data, the performance of the model has been significantly improved. Especially in terms of the Word Error Rate (WER) and Character Error Rate (CER) metrics, the performance of the model has been significantly improved. ### Main Contributions - **Innovative Data Augmentation Method**: The training data of the ASR system is enhanced by synthetic data, effectively solving the problem of data scarcity in Polish ASR. - **Performance Improvement**: Experiments have proven that using synthetic data can significantly improve the performance of ASR models, especially in low - resource language environments. ### Conclusion This research demonstrates the effectiveness of enhancing ASR system training by introducing synthetic data, providing a feasible method for solving the problem of data scarcity in low - resource languages. Future research can further explore how to generate more diverse and high - quality synthetic data to further improve the performance of ASR systems. --- If you need a more detailed explanation or a specific formula display, please let me know your specific requirements.