SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Takaaki Saeki,Soumi Maiti,Shinnosuke Takamichi,Shinji Watanabe,Hiroshi Saruwatari

2024-09-01

Abstract:While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address issues in speech generation evaluation by proposing a new automatic evaluation method to replace the traditionally costly manual subjective evaluation. Specifically, inspired by evaluation metrics in Natural Language Processing (NLP), the authors propose several reference audio-based automatic evaluation metrics, including SpeechBERTScore, SpeechBLEU, and SpeechTokenDistance. - **SpeechBERTScore**: Evaluates the semantic consistency between generated speech and reference speech by calculating the BERTScore. - **SpeechBLEU**: Assesses the quality of generated speech by calculating the BLEU score of discrete speech tokens. - **SpeechTokenDistance**: Measures character-level similarity by calculating the Levenshtein distance or Jaro-Winkler distance between generated speech and reference speech. These newly proposed evaluation methods outperform traditional objective evaluation metrics (such as MCD) in the assessment of synthetic and noisy speech and have cross-linguistic applicability. Furthermore, experimental results indicate that these methods are more robust under different conditions and can better correlate with human subjective scores.

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

BERTScore: Evaluating Text Generation with BERT

SpeechLMScore: Evaluating speech generation using speech language model

BARTScore: Evaluating Generated Text as Text Generation

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Semi-supervised Learning For Robust Speech Evaluation

SeMaScore : a new evaluation metric for automatic speech recognition tasks

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Acoustic BPE for Speech Generation with Discrete Tokens

BERTScoreVisualizer: A Web Tool for Understanding Simplified Text Evaluation with BERTScore

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

SUPERB-SG: Enhanced Speech Processing Universal PERformance Benchmark for Semantic and Generative Capabilities

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement

A Textless Metric for Speech-to-Speech Comparison

Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT