Improving Spoken Language Understanding with Cross-Modal Contrastive Learning
Jingjing Dong,Jiayi Fu,Peng Zhou,Hao Li,Xiaorui Wang
DOI: https://doi.org/10.21437/interspeech.2022-658
2022-01-01
Abstract:Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Meanwhile, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. Moreover, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.