Exploiting Phonological Similarities between African Languages to achieve Speech to Speech Translation

Peter Ochieng,Dennis Kaburu
2024-10-30
Abstract:This paper presents a pilot study on direct speech-to-speech translation (S2ST) by leveraging linguistic similarities among selected African languages within the same phylum, particularly in cases where traditional data annotation is expensive or impractical. We propose a segment-based model that maps speech segments both within and across language phyla, effectively eliminating the need for large paired datasets. By utilizing paired segments and guided diffusion, our model enables translation between any two languages in the dataset. We evaluate the model on a proprietary dataset from the Kenya Broadcasting Corporation (KBC), which includes five languages: Swahili, Luo, Kikuyu, Nandi, and English. The model demonstrates competitive performance in segment pairing and translation quality, particularly for languages within the same phylum. Our experiments reveal that segment length significantly influences translation accuracy, with average-length segments yielding the highest pairing quality. Comparative analyses with traditional cascaded ASR-MT techniques show that the proposed model delivers nearly comparable translation performance. This study underscores the potential of exploiting linguistic similarities within language groups to perform efficient S2ST, especially in low-resource language contexts.
Audio and Speech Processing,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of direct speech - to - speech translation (S2ST) between African languages, especially in low - resource language environments. Specifically, the paper explores the following key issues: 1. **Limitations of traditional cascaded methods**: - Traditional S2ST methods are usually divided into three subtasks: automatic speech recognition (ASR), text - to - text translation (MT), and text - to - speech synthesis (TTS). This method has the problem of error propagation, that is, errors in one subtask will be magnified in subsequent tasks, resulting in a decline in the overall translation quality. - In addition, traditional methods often lose some important speech elements during the translation process, such as the unique characteristics of the speaker (indexical components) and the natural rhythm of communication. 2. **Challenges of low - resource languages**: - For low - resource languages such as African languages, the lack of aligned or annotated text data makes text - to - text translation difficult or even impossible. - Collecting and annotating speech data is more difficult than collecting parallel text data, which limits the possibility of fully supervised end - to - end training. 3. **Efficient translation by using language similarity**: - The paper proposes a paragraph - mapping - based model to achieve direct S2ST across languages by using the phonetic similarities between languages within the same language family. This method can reduce the dependence on large - scale paired datasets. - Specifically, the paper explores how to use the language similarities within the same language family to improve the automatic annotation of speech segments and proposes two paragraph - mapping techniques: position - based mapping and embedding - based mapping. 4. **Evaluating and optimizing translation performance**: - The paper uses a proprietary dataset from the Kenya Broadcasting Corporation (KBC) for evaluation. This dataset includes five languages: Swahili, Luo, Kikuyu, Nandi, and English. - The experimental results show that the paragraph length significantly affects the translation accuracy, and paragraphs of average length perform best in terms of pairing quality. In addition, compared with the traditional cascaded ASR - MT method, the proposed model shows approximately competitive performance in translation. In summary, this paper attempts to develop a more efficient and accurate direct S2ST method by using the phonetic similarities between languages within the same language family, especially suitable for low - resource language environments.