AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost

Ahmet Gündüz,Yunsu Kim,Kamer Ali Yuksel,Mohamed Al-Badrashiny,Thiago Castro Ferreira,Hassan Sawaf
2024-09-19
Abstract:We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of selecting the optimal model for automatic speech recognition (ASR) systems when faced with different audio conditions. Specifically, the paper proposes the AutoMode-ASR framework, which improves overall transcription quality by predicting the most suitable ASR system for each audio segment while optimizing costs. The main contributions of the paper include: 1. **Proposing a new combination scheme**: Optimizing the quality and cost of ASR models at the paragraph level. 2. **Analyzing feature types**: Identifying which features are crucial for accurately predicting ASR system performance. 3. **Proposing a robust classification module**: Facilitating the gradual integration of new ASR systems. 4. **Introducing quality assessment**: Further optimizing performance. Experimental results show that compared to using a single best model, AutoMode-ASR reduces the word error rate (WER) by 16.2% relative, saves 65% of the cost, and increases speed by 75%. Additionally, the framework is compatible with both commercial and open-source black-box ASR systems without requiring modifications to the model code.