Abstract:Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

AudioVSR: Enhancing Video Speech Recognition with Audio Data

End-to-end lyrics Recognition with Voice to Singing Style Transfer

MM-ALT: A Multimodal Automatic Lyric Transcription System

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

Transfer Learning Using Musical Instrument Audio for Improving Automatic Singing Label Calibration

SongTrans: An unified song transcription and alignment method for lyrics and notes

Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

Self-Supervised Representations for Singing Voice Conversion

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Learning Singing From Speech

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

Transfer Learning Methods for Spoken Language Understanding

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Speech Technology for Everyone: Automatic Speech Recognition for Non-Native English with Transfer Learning

Songs Across Borders: Singable and Controllable Neural Lyric Translation

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion