Abstract:We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers, which contain discriminative information from different speakers, including spatial information embedded in interaural phase differences (IPDs) between individual interfering speakers and the target speaker. In the proposed SSA-AM framework, we explore four acoustic model architectures consisting of different combinations of four neural networks, namely deep residual network, factorized time delay neural network, self-attention and residual bidirectional long short-term memory neural network. Various data augmentation techniques are adopted to expand the training data to include different options of beamformed speech obtained from multi-channel speech enhancement. Evaluated on the recent CHiME-6 Challenge Track 1, our proposed SSA-AM framework achieves consistent recognition performance improvements when compared with the official baseline acoustic models. Furthermore, SSA-AM outperforms acoustic models without explicitly using the space and speaker information. Finally, our data augmentation schemes are shown to be especially effective for compact model designs. Code is released at https://github.com/coalboss/SSA_AM.

Adaptation of Hierarchical Structured Models for Speech Act Recognition in Asynchronous Conversation

Dialogue Act Recognition Via CRF-Attentive Structured Network

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Dialogue Act Sequence Labeling using Hierarchical encoder with CRF

Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations

Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation

Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

Topic Segmentation and Labeling in Asynchronous Conversations

Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech

Speech Activity Detection Based on Multilingual Speech Recognition System

Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers

Emotion Recognition in Conversation using Probabilistic Soft Logic

DARER: Dual-task Temporal Relational Recurrent Reasoning Network for Joint Dialog Sentiment Classification and Act Recognition

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Learning Domain Specific Language Models for Automatic Speech Recognition through Machine Translation

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition