Probing self-attention in self-supervised speech models for cross-linguistic differences

Sai Gopinath,Joselyn Rodriguez
2024-09-05
Abstract:Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to explore the performance of the self - attention mechanism in self - supervised speech models on cross - language differences. Specifically, the researchers are concerned with whether these models can learn language - independent (i.e., universal) speech representations, and whether there are significant differences in their self - attention mechanisms when processing different languages. By analyzing the self - attention heads of a small self - supervised speech Transformer model (TERA), the researchers explored the following points: 1. **Language independence**: The researchers wanted to verify whether these models can really learn language - independent speech representations, or whether their learning process will be affected by the training languages. 2. **Attention patterns**: The researchers analyzed the attention patterns of different languages (such as Turkish and English) to understand whether there are significant differences in these patterns. 3. **Speech feature learning**: The researchers hoped to experimentally verify whether these models have indeed learned important phonological information during the pre - training process. 4. **Importance of attention heads**: The researchers conducted an ablation study of attention heads to determine the importance of different types of attention heads (global, vertical, diagonal) in the phoneme classification task. Through these studies, the paper aims to gain a deeper understanding of the behavior of self - supervised speech models when processing different languages, thereby providing a theoretical basis for further optimizing the model architecture and improving the performance of multi - language tasks.