TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

Seung-Bin Kim,Sang-Hoon Lee,Seong-Whan Lee

DOI: https://doi.org/10.1109/ICASSP48485.2024.10447331

2024-01-17

Abstract:Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper proposes a new method called TranSentence to solve the speech-to-speech translation problem without the need for parallel language audio data. Traditional speech-to-speech translation systems rely on parallel language audio data between the source language and the target language for training. TranSentence captures the semantic information of speech by adopting language-independent sentence-level speech encoding, regardless of the language. This method first uses a pre-trained language-independent sentence-level speech encoder to encode the speech in the target language, and then trains the model to generate speech based on these encodings. During the inference stage, the speech embeddings of the source language can be used to generate speech in the target language. TranSentence has also been extended to multilingual speech-to-speech translation. Experimental results show that TranSentence outperforms other models in terms of performance, demonstrating its ability to perform speech-to-speech translation without parallel language audio data. Additionally, the researchers propose a modeling method for generating speech from speech embeddings through feature extension.

TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Pre-training for Speech Translation: CTC Meets Optimal Transport

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

A Bilingual Generative Transformer for Semantic Sentence Embedding

TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

Improving Speech Translation by Understanding the Speech From Latent Code

Translatotron 3: Speech to Speech Translation with Monolingual Data

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

AlloST: Low-resource Speech Translation without Source Transcription

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Textless Speech-to-Speech Translation With Limited Parallel Data

TransAug: Translate as Augmentation for Sentence Embeddings