Abstract:Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is comparable with some other early supervised methods in some language pairs. While cascaded systems always suffer from severe error propagation problems, we proposed denoising back-translation (DBT), a novel approach to building robust unsupervised neural machine translation (UNMT). DBT successfully increases the BLEU score by 0.7--0.9 in all three translation directions. Moreover, we simplified the pipeline of our cascaded system to reduce inference latency and conducted a comprehensive analysis of every part of our work. We also demonstrate our unsupervised speech translation results on the established website.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to construct a Cascade Unsupervised Speech - to - Speech Translation system (US2ST) without any paired data, in order to overcome the problem of scarce parallel data in low - resource languages. Specifically, the researchers proposed a new method - Denoising Back - Translation (DBT) - to improve the robustness of the Unsupervised Neural Machine Translation (UNMT) system, thereby reducing the error - propagation problem in the cascade system. In addition, in order to reduce the inference latency of the cascade system, they also simplified the system process by normalizing the output of UASR, using a text decoder to reconstruct the unnormalized text, and then inputting it into UNMT. ### Background and Problem Description of the Paper The goal of Speech Translation (ST) is to convert the speech of one language into the text or speech of another language, facilitating barrier - free communication between speakers of different languages. Traditional Speech - to - Text Translation (S2TT) systems are usually composed of an Automatic Speech Recognition (ASR) module and a Text - to - Text Machine Translation (MT) module connected in series. And the Cascade Speech - to - Speech Translation (S2ST) system further adds a Text - to - Speech Synthesis (TTS) module on this basis. Although direct S2TT and S2ST systems have made progress in reducing error - propagation and inference latency, most ST systems rely on parallel data, which is very limited in low - resource languages and severely restricts the performance of these systems. ### Proposed Methods 1. **Denoising Back - Translation (DBT)** - **Purpose**: Improve the robustness of the UNMT system and reduce error - propagation in the cascade system. - **Method**: DBT combines the ideas of denoising auto - encoding and back - translation. By generating pseudo - labels on noisy sentences, the model is trained to recover clean sentences from noisy pseudo - sentences, thereby enhancing the robustness of the model. - **Formula**: \[ L_{\text{DBT}} = E_{x \in S}[-\log P_{T \to S}(x | u^*(f(x)))] + E_{y \in T}[-\log P_{S \to T}(y | v^*(f(y)))] \] where \(x\) and \(y\) are sentences in the source language and the target language respectively, \(u^*\) and \(v^*\) are translation functions, and \(f(\cdot)\) is a noise function. 2. **System Simplification** - **Purpose**: Reduce the inference latency of the cascade system. - **Method**: Normalize the output of UASR (remove punctuation marks and convert to lowercase), then use a text decoder to reconstruct the unnormalized text and input it into UNMT. In this way, the system process can be simplified and the inference time can be significantly reduced. ### Experimental Results The researchers evaluated their system on multi - lingual S2ST and S2TT datasets of multiple languages, including CVSS and CoV oST 2. The experimental results show that their US2ST system performs reasonably well in multiple translation directions and even outperforms earlier supervised methods on some language pairs. In particular, the DBT method improves the BLEU score in all three translation directions, with an average increase of 0.7 to 0.9 points. ### Conclusion This paper successfully constructs a cascade unsupervised speech - translation system without any paired data, and significantly improves the robustness and performance of the system by introducing the DBT method. In addition, by simplifying the system process, the inference latency is reduced, making the system more practical. These achievements provide new ideas and methods for speech - translation research in low - resource languages.

Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data.

Tight Integrated End-to-End Training for Cascaded Speech Translation

Bidirectional Boost: On Improving Tibetan-Chinese Neural Machine Translation With Back-Translation and Self-Learning

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

DUB: Discrete Unit Back-translation for Speech Translation

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Translatotron 3: Speech to Speech Translation with Monolingual Data

Back Translation for Speech-to-text Translation Without Transcripts

Towards a Deep Understanding of Multilingual End-to-End Speech Translation

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

Multi-Task Self-Supervised Learning Based Tibetan-Chinese Speech-to-Speech Translation.

End-to-End Speech Translation with Adversarial Training