TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

Seung-Bin Kim,Sang-Hoon Lee,Seong-Whan Lee
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447331
2024-01-17
Abstract:Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper proposes a new method called TranSentence to solve the speech-to-speech translation problem without the need for parallel language audio data. Traditional speech-to-speech translation systems rely on parallel language audio data between the source language and the target language for training. TranSentence captures the semantic information of speech by adopting language-independent sentence-level speech encoding, regardless of the language. This method first uses a pre-trained language-independent sentence-level speech encoder to encode the speech in the target language, and then trains the model to generate speech based on these encodings. During the inference stage, the speech embeddings of the source language can be used to generate speech in the target language. TranSentence has also been extended to multilingual speech-to-speech translation. Experimental results show that TranSentence outperforms other models in terms of performance, demonstrating its ability to perform speech-to-speech translation without parallel language audio data. Additionally, the researchers propose a modeling method for generating speech from speech embeddings through feature extension.