SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation

Sameer Khurana,Antoine Laurent,James Glass
DOI: https://doi.org/10.1109/jstsp.2022.3192714
IF: 7.695
2022-10-22
IEEE Journal of Selected Topics in Signal Processing
Abstract:We propose the ( ): Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10–20 ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5–10 s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model with the Language Agnostic BERT Sentence Embedding ( ) model to create an utterance-level multimodal multilingual speech encoder . Although we train with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use speech encoder in combination with a pre-trained text sentence encoder for cross-lingual speech-to-text translation retrieval, and - lone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.
engineering, electrical & electronic
What problem does this paper attempt to address?