A Scalable Sparse Transformer Model for Singing Melody Extraction.

Shuai Yu,Jun Liu,Yi Yu,Wei Li
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447953
2024-01-01
Abstract:Extracting the melody of a singing voice is an essential task within the realm of music information retrieval (MIR). Recently, transformer based models have drawn great attention in the field of MIR. However, due to the expensive computation cost and extensive parameters, it is difficult to train and deploy a transformer-based model for practical singing melody extraction. In this paper, we propose a simple yet effective scalable sparse transformer for singing melody extraction. To be specific, we first propose to employ a sparse transformer to reduce computation cost and the amount of parameters. Then, we proposed to scale the self-attention region of the sparse transformer in the spectrogram to obtain more accurate performance. Moreover, we propose to combine a scalable sparse transformer (S <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> Former) with CNN-based model to extract global and local features in the spectrogram. The proposed scalable transformer model can achieve a better balance between a standard transformer and a sparse transformer. To better fuse the features from transformer and CNN, we further propose a transformer-CNN fusion (TCF) module to combine significant features from transformer and CNN. The proposed model obtains state-of-the-art results on several public datasets. The conducted experiments confirm the effectiveness of the model we proposed.
What problem does this paper attempt to address?