Abstract:Spoken Language Understanding (SLU) is a core task in most human-machine interaction systems. With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimization of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties used by an E2E model to perform the SLU task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands. The results show that a good E2E SLU performance does not always require a perfect ASR capability. Furthermore, the results show the superior capabilities of the E2E model in handling background noise and syntactic variation compared to the pipeline model. Finally, a finer-grained analysis suggests that the E2E model uses the pitch information of the input signal to identify voice command concepts. The results and methodology outlined in this paper provide a springboard for further analyses of E2E models in speech processing.

On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding

Semantic enrichment towards efficient speech representations

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Generating More Audios for End-to-End Spoken Language Understanding

Understanding Semantics from Speech Through Pre-training

The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

On joint training with interfaces for spoken language understanding

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

WaBERT: A Low-resource End-to-end Model for Spoken Language Understanding and Speech-to-BERT Alignment

Semi-Supervised Spoken Language Understanding Via Self-Supervised Speech and Language Model Pretraining.

End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Introducing Semantics into Speech Encoders.

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Bidirectional Representations for Low Resource Spoken Language Understanding