Abstract:This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate assessment metrics predicted by GPT-4o and GPT-Whisper examining their correlations with human-based quality and intelligibility assessments, and character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is not effective for audio analysis; whereas, GPT-Whisper demonstrates higher prediction, showing moderate correlation with speech quality and intelligibility, and high correlation with CER. Compared to supervised non-intrusive neural speech assessment models, namely MOS-SSL and MTI-Net, GPT-Whisper yields a notably higher Spearman's rank correlation with the CER of Whisper. These findings validate GPT-Whisper as a reliable method for accurate zero-shot speech assessment without requiring additional training data (speech data and corresponding assessment scores).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use large - language models (LLMs) to conduct zero - sample non - invasive speech evaluation without additional training data. Specifically, the researchers explored two strategies: 1. **Directly utilize the audio analysis capabilities of GPT - 4o**: - The researchers attempted to directly use GPT - 4o to evaluate the quality and comprehensibility of speech. However, the experimental results show that relying solely on GPT - 4o for audio analysis is not ideal because it mainly depends on simple features such as the amplitude range, standard deviation, and signal - to - noise ratio (SNR) of the audio signal and cannot accurately evaluate speech quality and comprehensibility. 2. **Propose the GPT - Whisper method**: - GPT - Whisper combines the audio - to - text capabilities of the Whisper model and the natural - language - processing capabilities of GPT - 4o. Specifically, first use Whisper to convert the audio into text, and then, through specific prompt engineering, let GPT - 4o evaluate the naturalness of the generated text. The experimental results show that GPT - Whisper shows a higher correlation in evaluating speech quality and comprehensibility and has a significant correlation with the character error rate (CER) of automatic speech recognition (ASR). ### Main contributions - **Zero - sample evaluation**: GPT - Whisper can effectively evaluate speech quality without additional training data. - **High correlation**: GPT - Whisper shows a relatively high Spearman rank - correlation coefficient (SRCC) between human - evaluated quality and comprehensibility scores and the CER of ASR, especially reaching a correlation of 0.7784 on the CER. - **Outperforms supervised models**: Compared with existing supervised - learning models (such as MOS - SSL and MTI - Net), GPT - Whisper performs better in predicting the Whisper CER, further validating its effectiveness as a zero - sample speech - evaluation method. ### Experimental verification - The researchers used the TMHINT - QI(S) dataset for the experiment. This dataset covers audio samples under different noise conditions and speech - enhancement systems. The experimental results show that GPT - Whisper has high accuracy in evaluating speech quality and comprehensibility and can effectively replace supervised - learning models for evaluation. ### Conclusion This research shows that by combining Whisper and GPT - 4o, an effective zero - sample non - invasive speech - evaluation system can be constructed, especially suitable for scenarios lacking labeled data. Future research will further optimize prompt engineering and explore the potential of GPT - Whisper in more speech - processing applications.

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

A Study on Incorporating Whisper for Robust Speech Assessment

Intelli-Z: Toward Intelligible Zero-Shot TTS

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Language Models

Pronunciation Assessment with Multi-modal Large Language Models

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction

Residual-Guided Non-Intrusive Speech Quality Assessment