Abstract:Objective: Despite speech being the primary communication medium, it carries valuable information about a speaker's health, emotions, and identity. Various conditions can affect the vocal organs, leading to speech difficulties. Extensive research has been conducted by voice clinicians and academia in speech analysis. Previous approaches primarily focused on one particular task, such as differentiating between normal and dysphonic speech, classifying different voice disorders, or estimating the severity of voice disorders. Methods and procedures: This study proposes an approach that combines transfer learning and multitask learning (MTL) to simultaneously perform dysphonia classification and severity estimation. Both tasks use a shared representation; network is learned from these shared features. We employed five computer vision models and changed their architecture to support multitask learning. Additionally, we conducted binary 'healthy vs. dysphonia' and multiclass 'healthy vs. organic and functional dysphonia' classification using multitask learning, with the speaker's sex as an auxiliary task. Results: The proposed method achieved improved performance across all classification metrics compared to single-task learning (STL), which only performs classification or severity estimation. Specifically, the model achieved F1 scores of 93% and 90% in MTL and STL, respectively. Moreover, we observed considerable improvements in both classification tasks by evaluating beta values associated with the weight assigned to the sex-predicting auxiliary task. MTL achieved an accuracy of 77% compared to the STL score of 73.2%. However, the performance of severity estimation in MTL was comparable to STL. Conclusion: Our goal is to improve how voice pathologists and clinicians understand patients' conditions, make it easier to track their progress, and enhance the monitoring of vocal quality and treatment procedures. Clinical and Translational Impact Statement: By integrating both classification and severity estimation of dysphonia using multitask learning, we aim to enable clinicians to gain a better understanding of the patient's situation, effectively monitor their progress and voice quality.

Hierarchical Multitask Learning for CTC-based Speech Recognition

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

Advancing Multi-Accented LSTM-CTC Speech Recognition using a Domain Specific Student-Teacher Learning Paradigm

Speaker Adaptation for End-to-End CTC Models.

Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition.

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Enhancing CTC-based speech recognition with diverse modeling units

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

An improved hybrid CTC-Attention model for speech recognition

Advancing Acoustic-to-Word CTC Model

Multitask and Transfer Learning Approach for Joint Classification and Severity Estimation of Dysphonia

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

Hierarchical Multilabel Text Classification Via Multitask Learning.

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Multi-Task Learning for Front-End Text Processing in TTS

CR-CTC: Consistency regularization on CTC for improved speech recognition