SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Takaaki Saeki,Soumi Maiti,Shinnosuke Takamichi,Shinji Watanabe,Hiroshi Saruwatari
2024-09-01
Abstract:While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address issues in speech generation evaluation by proposing a new automatic evaluation method to replace the traditionally costly manual subjective evaluation. Specifically, inspired by evaluation metrics in Natural Language Processing (NLP), the authors propose several reference audio-based automatic evaluation metrics, including SpeechBERTScore, SpeechBLEU, and SpeechTokenDistance. - **SpeechBERTScore**: Evaluates the semantic consistency between generated speech and reference speech by calculating the BERTScore. - **SpeechBLEU**: Assesses the quality of generated speech by calculating the BLEU score of discrete speech tokens. - **SpeechTokenDistance**: Measures character-level similarity by calculating the Levenshtein distance or Jaro-Winkler distance between generated speech and reference speech. These newly proposed evaluation methods outperform traditional objective evaluation metrics (such as MCD) in the assessment of synthetic and noisy speech and have cross-linguistic applicability. Furthermore, experimental results indicate that these methods are more robust under different conditions and can better correlate with human subjective scores.