Direct Speech-to-Speech Neural Machine Translation: A Survey

Mahendra Gupta,Maitreyee Dutta,Chandresh Kumar Maurya
2024-11-13
Abstract:Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models' performance over the benchmark datasets and provide research challenges and future directions.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenges in Direct Speech - to - Speech Translation (S2ST). Specifically, the paper focuses on how to directly convert the speech of one language into the speech of another language through the direct S2ST model without going through the intermediate text - representation step. Compared with the traditional cascaded model (i.e., the combination of Automatic Speech Recognition (ASR), Machine Translation (MT) and Text - to - Speech Synthesis (TTS)), this direct translation method has lower decoding latency and better ability to preserve intonation and non - linguistic features. However, it still faces the problem of insufficient performance in practical applications at present, especially in real - world translation tasks, and there is a gap compared with the cascaded model. The main contributions of the paper lie in a comprehensive review of the research on the direct S2ST model, including data and application issues, performance evaluation metrics, and an analysis of the performance of existing models on benchmark datasets, pointing out the challenges in the research and future development directions. These issues and challenges mainly include: 1. **Scarcity of parallel speech corpora**: It is very difficult to obtain a large amount of parallel speech data between two different languages, which limits the training of the model and the improvement of performance. 2. **Processing of non - written languages**: For languages without a writing system, text - based model training is not feasible, and the direct S2ST model needs to be able to handle such languages. 3. **Security threats of voice cloning**: The direct S2ST model may be used to clone an individual's voice, bringing privacy and security issues. 4. **Lack of direct quality evaluation metrics**: Existing quality evaluation methods mainly rely on text, while the direct S2ST model needs to be able to directly evaluate the quality of the generated speech and the reference speech. 5. **Segmentation problems**: Especially in Simultaneous S2ST, since the input is partial, calculating the average latency is a major challenge. By solving these problems, the paper aims to promote the development of the direct S2ST technology so that it can reach the high - quality level required for seamless communication.