Abstract:Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.

On Architectures and Training for Raw Waveform Feature Extraction in ASR

Comparative Analysis of the wav2vec 2.0 Feature Extractor

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

wav2vec and its current potential to Automatic Speech Recognition in German for the usage in Digital History: A comparative assessment of available ASR-technologies for the use in cultural heritage contexts

Multi-Span Acoustic Modelling Using Raw Waveform Signals.

Self-Supervised Learning for Multi-Channel Neural Transducer

Efficient Utilization of Large Pre-Trained Models for Low Resource ASR

Analyzing And Improving Neural Speaker Embeddings for ASR

Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Wav2vec‐MoE: an Unsupervised Pre‐training and Adaptation Method for Multi‐accent ASR

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model.

On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Learning Architectures from an Extended Search Space for Language Modeling

Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models