Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Shaoxiang Dang,Tetsuya Matsumoto,Yoshinori Takeuchi,Takashi Tsuboi,Yasuhiro Tanaka,Daisuke Nakatsubo,Satoshi Maesawa,Ryuta Saito,Masahisa Katsuno,Hiroaki Kudo
2024-08-22
Abstract:The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been achieved in predicting the voice quality of patients in the context of STN-DBS.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a method for evaluating the voice quality of patients with damaged vocal cord systems. Specifically, this research aims to use automatic speech recognition (ASR) representations and other multi - modal features to estimate the voice quality of patients, in order to overcome the limitations of traditional auditory - perceptual judgment methods. ### Main problems and challenges 1. **Limited and unbalanced clinical data samples**: Deep - learning models require a large amount of data to extract robust features, but clinical voice data are often limited and unbalanced. 2. **Limitations of subjective evaluation**: Traditional auditory - perceptual evaluation depends on experienced doctors or speech pathologists to evaluate sustained vowels and continuous speech. This method has the following problems: - It requires evaluators with extensive clinical experience. - In order to improve the reliability of the evaluation, the participation of multiple evaluators is usually required. - The evaluation cycle is long, which affects doctors' timely acquisition of results. 3. **Lack of objective evaluation means**: Existing objective evaluation methods mainly focus on sustained vowels, and the amount of data of these vowels is limited, which is not sufficient to support the effective training of deep - learning models. ### Solutions In order to solve the above problems, this research proposes a new method to evaluate the voice quality of patients by combining the following techniques: - **Automatic speech recognition (ASR) representation**: Use a pre - trained ASR model (such as Whisper) to extract voice features. - **Self - supervised learning (SSL) representation**: Use self - supervised pre - trained models such as HuBERT to extract voice features. - **Mel - spectrogram**: Capture the frequency characteristics of audio signals. This method can not only handle sustained vowels, but also continuous speech, thus improving the accuracy and robustness of the evaluation. The experimental results show that when predicting indicators such as Grade, Breathy and Asthenic, this method shows a significant correlation and high accuracy (PCC > 0.8, MSE < 0.5). In addition, this research also explored the changes in the voice quality of patients with Parkinson's disease before and after subthalamic nucleus deep brain stimulation (STN - DBS) surgery, further verifying the effectiveness of the proposed method. ### Conclusion This research provides a more objective and efficient clinical voice quality evaluation method by introducing ASR, SSL and Mel - spectrogram features. The experimental results show that this method performs better than traditional methods in multiple indicators, and can be applied to actual clinical scenarios. In particular, it shows preliminary predictive ability in evaluating the postoperative voice quality of patients with Parkinson's disease.