Abstract:Extracting the melody of a singing voice is an essential task within the realm of music information retrieval (MIR). Recently, transformer based models have drawn great attention in the field of MIR. However, due to the expensive computation cost and extensive parameters, it is difficult to train and deploy a transformer-based model for practical singing melody extraction. In this paper, we propose a simple yet effective scalable sparse transformer for singing melody extraction. To be specific, we first propose to employ a sparse transformer to reduce computation cost and the amount of parameters. Then, we proposed to scale the self-attention region of the sparse transformer in the spectrogram to obtain more accurate performance. Moreover, we propose to combine a scalable sparse transformer (S <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> Former) with CNN-based model to extract global and local features in the spectrogram. The proposed scalable transformer model can achieve a better balance between a standard transformer and a sparse transformer. To better fuse the features from transformer and CNN, we further propose a transformer-CNN fusion (TCF) module to combine significant features from transformer and CNN. The proposed model obtains state-of-the-art results on several public datasets. The conducted experiments confirm the effectiveness of the model we proposed.

Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

A Multi-Scale Attentive Transformer for Multi-Instrument Symbolic Music Generation

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Transfer of knowledge among instruments in automatic music transcription

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

Hybrid Transformers for Music Source Separation

Source Separation & Automatic Transcription for Music

Multi-Instrument Polyphonic Melody Transcription Based on Deep Learning

A Scalable Sparse Transformer Model for Singing Melody Extraction.

Transformer-XL Based Music Generation with Multiple Sequences of Time-valued Notes

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Piano automatic transcription based on transformer

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Multitrack Music Transcription with a Time-Frequency Perceiver

SpecTNT: a Time-Frequency Transformer for Music Audio

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

Separation of Music Signals by Harmonic Structure Modeling.

Unaligned Supervision For Automatic Music Transcription in The Wild