Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Ling Huang,Q. Hong,Lin Li,Haodong Zhou,Tao Li
DOI: https://doi.org/10.21437/interspeech.2023-758
2023-08-20
Abstract:The deliberation-based two-pass model that combines both semantic and acoustic information can effectively improve the performance of end-to-end (E2E) spoken language understanding (SLU). However, existing two-pass models usually simply fuse speech embedding and text embedding without taking into account the inherent distinctions between these two modalities. We propose a novel approach named C ross-modal S emantic A lignment before F usion (CSAF), which adopt contrastive loss aligning speech and text embeddings before fusing them. We introduce a shared semantic memory transformer to project the embeddings from two modalities into a common semantic space, and a multi-modal gated network to generate the fused embeddings. We conduct experiments on the FSC Challenge test set and SLURP dataset. The results demonstrate that our method can significantly promote intent classification accuracy, achieving an absolute improvement of 3.1% over previous works in the FSC Challenge Utterance Set.
Computer Science
What problem does this paper attempt to address?