Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

Yuanchao Li,Yumnah Mohamied,Peter Bell,Catherine Lai
DOI: https://doi.org/10.48550/arXiv.2210.02595
2022-12-13
Abstract:Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative analysis, we mainly demonstrate that: 1) wav2vec 2.0 appears to discard paralinguistic information that is less useful for word recognition purposes; 2) for emotion recognition, representations from the middle layer alone perform as well as those derived from layer averaging, while the final layer results in the worst performance in some cases; 3) current self-supervised models may not be the optimal solution for downstream tasks that make use of non-lexical features. Our work provides novel findings that will aid future research in this area and theoretical basis for the use of existing models.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the performance of self - supervised speech models (especially wav2vec 2.0) on emotional corpora and their effectiveness for emotion recognition tasks. Specifically, the author explores the following aspects through a series of quantitative analyses: 1. **Processing of non - morpheme information by wav2vec 2.0**: The study found that the wav2vec 2.0 model seems to discard some paralinguistic information that is not important for morpheme recognition. This means that when performing emotion recognition, the model may ignore certain key emotional features. 2. **Performance differences in different layers**: For emotion recognition tasks, the representations from the intermediate layers are comparable in performance to the representation obtained by averaging all layers, while the representation of the final layer performs the worst in some cases. This indicates that using the final layer of the model may not be the best choice in emotion recognition tasks. 3. **Applicability of self - supervised models to non - morpheme feature tasks**: Current self - supervised models may not be the best solution for all downstream tasks, especially in tasks that need to utilize non - morpheme features. These findings will help future researchers better understand and apply existing self - supervised models. Through these studies, the author hopes to provide new perspectives and theoretical foundations to promote more effective application of self - supervised models in tasks such as speech emotion recognition.