Abstract:Song translation requires both translation of lyrics and alignment of music notes so that the resulting verse can be sung to the accompanying melody, which is a challenging problem that has attracted some interests in different aspects of the translation process. In this paper, we propose Lyrics-Melody Translation with Adaptive Grouping (LTAG), a holistic solution to automatic song translation by jointly modeling lyrics translation and lyrics-melody alignment. It is a novel encoder-decoder framework that can simultaneously translate the source lyrics and determine the number of aligned notes at each decoding step through an adaptive note grouping module. To address data scarcity, we commissioned a small amount of training data annotated specifically for this task and used large amounts of augmented data through back-translation. Experiments conducted on an English-Chinese song translation data set show the effectiveness of our model in both automatic and human evaluation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is Automatic Song Translation (AST), that is, while translating lyrics, ensure that the translated lyrics can be aligned with the original melody, thus maintaining the complete aesthetic feeling of the song. Specifically, the paper proposes a framework named Lyrics - Melody Translation with Adaptive Grouping (LTAG), aiming to handle lyrics translation and lyrics - melody alignment simultaneously, overcoming the limitations of separately handling lyrics translation and melody alignment in existing methods. LTAG realizes high - quality lyrics translation and reasonable lyrics - melody alignment by introducing an adaptive grouping module to dynamically predict the number of aligned notes in each decoding step. ### Main Contributions 1. **Proposed the first framework for jointly translating lyrics and aligning lyrics - melody**: LTAG can model lyrics translation and lyrics - melody alignment simultaneously within the Transformer encoder - decoder framework. 2. **Designed an adaptive grouping method**: This method achieves high - quality lyrics translation and flexible and reasonable lyrics - melody alignment in the autoregressive process. 3. **Generated the first bilingual lyrics - melody alignment dataset**: This dataset will be publicly released to promote further research in this field. In addition, the data scarcity problem is solved through back - translation and curriculum learning strategies. 4. **Experimentally verified the effectiveness of LTAG**: The experimental results show that LTAG significantly outperforms the baseline systems in both automatic metrics and human evaluations. The translated lyrics are faithful to the original text and can be aligned with the melody, having high singability and overall quality. ### Method Overview - **Overall Architecture**: LTAG adopts an autoregressive translation architecture, including a Transformer - based encoder - decoder module, two embedding layers for processing note and alignment information, and an alignment decoder. - **Note Pooling Embedding**: This module encodes the note sequence and alignment information into embedding vectors and generates melody embeddings and alignment embeddings through non - overlapping mean pooling operations. - **Alignment Decoder**: Inspired by Adaptive Computation Time (ACT), the alignment decoder dynamically predicts the number of notes corresponding to each target word through an adaptive grouping module. - **Back - Translation and Alignment**: To alleviate the data scarcity problem, the paper adopts the back - translation method to generate more training data and gradually adjusts the data sampling ratio through the curriculum learning strategy. ### Experimental Results - **Translation Evaluation**: In the Chinese - to - English and English - to - Chinese lyrics translation tasks, LTAG performs excellently on the human evaluation metric (MOS - T), significantly outperforming the baseline systems. - **Lyrics - Melody Alignment Evaluation**: In terms of lyrics - melody alignment quality, LTAG significantly outperforms other systems on the human evaluation metric (MOS - S), especially in terms of reasonable alignment. - **Comprehensive Evaluation**: Considering translation quality and alignment quality comprehensively, LTAG also performs best on the overall quality evaluation (MOS - Q). ### Conclusion The LTAG framework proposed in the paper shows significant advantages in the automatic song translation task. It not only outperforms existing methods in translation quality but also makes a breakthrough in lyrics - melody alignment, providing a new direction for further research in this field.

Translate the Beauty in Songs: Jointly Learning to Align Melody and Translate Lyrics

Songs Across Borders: Singable and Controllable Neural Lyric Translation

SongTrans: An unified song transcription and alignment method for lyrics and notes

A Computational Evaluation Framework for Singable Lyric Translation

Automatic Song Translation for Tonal Languages

SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

Unsupervised Melody-Guided Lyrics Generation

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Sing it, Narrate it: Quality Musical Lyrics Translation

Improving Lyrics Alignment Through Joint Pitch Detection

Deep Attention-Based Alignment Network for Melody Generation from Incomplete Lyrics

Unsupervised Melody-to-Lyric Generation

Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

Neural Melody Composition from Lyrics

Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning

Modeling the Rhythm from Lyrics for Melody Generation of Pop Song

ReLyMe: Improving Lyric-to-Melody Generation by Incorporating Lyric-Melody Relationships

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Melody Generation from Lyrics with Local Interpretability