Abstract:Speech-to-speech translation tasks are commonly tackled by using a three-level cascade system which comprises of speech recognition, machine translation, and speech synthesis. However, this approach suffers from the drawback of error accumulation at each stage. In contrast, the direct speech-to-speech translation model directly converts speech from the source language to the target language without relying on intermediate text generation, thereby avoiding the issue of incorrect transmission in cascading systems. Currently, there exist two categories for direct speech-to-speech translation methods. The first involves mapping the Mel-spectrogram of the source language speech to the Mel-spectrogram of the target language speech. However, this method often encounters challenges in convergence and producing the audible speech for the target language. The second type of methods is to learn a self-supervised discrete representation of the target language using an unlabeled speech corpus. This method entails training a sequence-to-sequence model on a real-world dataset, which then maps the source language speech to the discrete representation of the target language. Finally, a separately trained vocoder is utilized to convert the discrete unit sequence into a speech waveform. Given the limited availability of large-scale Tibetan-Chinese parallel speech corpora, this work adopts the second method to model Tibetan-Chinese speech-to-speech translation tasks. Additionally, a multi-task learning framework is designed in this work to enhance the performance of the speech translation model. Experimental results demonstrate that the Tibetan-Chinese speech-to-speech translation model based on multi-task self-supervised learning outperforms both the model based on spectrogram mapping and the single-task self-supervised learning model in terms of achieving a higher BLUE value.

A Self-Supervised Model for Language Identification Integrating Phonological Knowledge

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Explore the Use of Self-supervised Pre-trained Acoustic Features on Disguised Speech Detection

End-to-end Oriental Language Speech Recognition with Integrated Language Identification

Phonetic Temporal Neural Model for Language Identification

Conformer-based Language Embedding with Self-Knowledge Distillation for Spoken Language Identification

Two-stage Training for Chinese Dialect Recognition

Deep temporal representation learning for language identification

Phone-Aware Multi-task Learning and Length Expanding for Short-Duration Language Recognition.

CNN-Based End-To-End Language Identification

Joint unsupervised and supervised learning for context-aware language identification

Language Identification Based on Convolutional Neural Network

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Transducer-based language embedding for spoken language identification

Temporal feature extraction based on CNN-BLSTM and temporal pooling for language identification

Multi-Task Self-Supervised Learning Based Tibetan-Chinese Speech-to-Speech Translation.

Deep joint learning for language recognition

Enhance Language Identification using Dual-mode Model with Knowledge Distillation

Insights into End-to-End Learning Scheme for Language Identification

LID-senone Extraction Via Deep Neural Networks for End-to-End Language Identification

An Improved LSTM for Language Identification