A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

Ryandhimas E. Zezario,Sabato M. Siniscalchi,Hsin-Min Wang,Yu Tsao
2024-09-16
Abstract:This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate assessment metrics predicted by GPT-4o and GPT-Whisper examining their correlations with human-based quality and intelligibility assessments, and character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is not effective for audio analysis; whereas, GPT-Whisper demonstrates higher prediction, showing moderate correlation with speech quality and intelligibility, and high correlation with CER. Compared to supervised non-intrusive neural speech assessment models, namely MOS-SSL and MTI-Net, GPT-Whisper yields a notably higher Spearman's rank correlation with the CER of Whisper. These findings validate GPT-Whisper as a reliable method for accurate zero-shot speech assessment without requiring additional training data (speech data and corresponding assessment scores).
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use large - language models (LLMs) to conduct zero - sample non - invasive speech evaluation without additional training data. Specifically, the researchers explored two strategies: 1. **Directly utilize the audio analysis capabilities of GPT - 4o**: - The researchers attempted to directly use GPT - 4o to evaluate the quality and comprehensibility of speech. However, the experimental results show that relying solely on GPT - 4o for audio analysis is not ideal because it mainly depends on simple features such as the amplitude range, standard deviation, and signal - to - noise ratio (SNR) of the audio signal and cannot accurately evaluate speech quality and comprehensibility. 2. **Propose the GPT - Whisper method**: - GPT - Whisper combines the audio - to - text capabilities of the Whisper model and the natural - language - processing capabilities of GPT - 4o. Specifically, first use Whisper to convert the audio into text, and then, through specific prompt engineering, let GPT - 4o evaluate the naturalness of the generated text. The experimental results show that GPT - Whisper shows a higher correlation in evaluating speech quality and comprehensibility and has a significant correlation with the character error rate (CER) of automatic speech recognition (ASR). ### Main contributions - **Zero - sample evaluation**: GPT - Whisper can effectively evaluate speech quality without additional training data. - **High correlation**: GPT - Whisper shows a relatively high Spearman rank - correlation coefficient (SRCC) between human - evaluated quality and comprehensibility scores and the CER of ASR, especially reaching a correlation of 0.7784 on the CER. - **Outperforms supervised models**: Compared with existing supervised - learning models (such as MOS - SSL and MTI - Net), GPT - Whisper performs better in predicting the Whisper CER, further validating its effectiveness as a zero - sample speech - evaluation method. ### Experimental verification - The researchers used the TMHINT - QI(S) dataset for the experiment. This dataset covers audio samples under different noise conditions and speech - enhancement systems. The experimental results show that GPT - Whisper has high accuracy in evaluating speech quality and comprehensibility and can effectively replace supervised - learning models for evaluation. ### Conclusion This research shows that by combining Whisper and GPT - 4o, an effective zero - sample non - invasive speech - evaluation system can be constructed, especially suitable for scenarios lacking labeled data. Future research will further optimize prompt engineering and explore the potential of GPT - Whisper in more speech - processing applications.