Explainable Stuttering Recognition Using Axial Attention.

Yu Ma,Yuting Huang,Kaixiang Yuan,Guangzhe Xuan,Yongzi Yu,Hengrui Zhong,Rui Li,Jian Shen,Kun Qian,Bin Hu,Bjorn W. Schuller,Yoshiharu Yamamoto
DOI: https://doi.org/10.1007/978-981-99-4749-2_18
2023-01-01
Abstract:Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (Grad-CAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022 , demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering.
What problem does this paper attempt to address?