Abstract:Singing melody extraction from polyphonic musical audio is one of the most challenging tasks in music information retrieval (MIR). Recently, data-driven methods based on convolutional neural networks (CNNs) have achieved great success for this task. In the literature, harmonic relationship has been proven crucial for this task. However, few existing CNN-based singing melody extraction methods consider the harmonic relationship in the training stage. The state-of-the-art CNN based methods are not capable of capturing such long-dependency harmonic relationship due to limited receptive field and unacceptable computation cost. In this paper, we introduce a neural harmonic-aware network with gated attentive fusion (NHAN-GAF) for singing melody extraction. Specifically, in the 2-D spectrograms modeling branch, we propose to employ multiple parallel 1-D CNN kernels to capture the harmonic relations between 1–2 octaves along the frequency axis in the spectrogram. Considering the advantage of jointly using Time–Frequency (T-F) domain and time domain information, we use two-branch neural nets to learn discriminative representation for this task. A novel gated attentive fusion (GAF) network is suggested to encode potential correlations between the two branches and fuse the descriptors learned from raw waveform and T-F spectrograms. Moreover, the idea of GAF can be exploited to the multimedia applications with multimodal analysis. With the two proposed components, our proposed model is capable of learning the harmonic relationship in the spectrogram and better capturing the contextual but discriminative features for singing melody extraction. We use part of the vocal tracks of the RWC dataset and MIR-1 K dataset to train the model and evaluate the performance of the proposed model on the ADC 2004, MIREX 05 and MedleyDB datasets. The experimental results show that the proposed method outperforms the state-of-the-art ones.

Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Towards Solving The Bottleneck Of Pitch-Based Singing Voice Separation

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Improving Real-Time Music Accompaniment Separation with MMDenseNet

Voice and accompaniment separation in music using self-attention convolutional neural network

Mel-RoFormer for Vocal Separation and Vocal Melody Transcription

HANME: Hierarchical Attention Network for Singing Melody Extraction

A Multi-task Learning Approach for Melody Extraction

Mixing or Extracting? Further Exploring Necessity of Music Separation for Singer Identification

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Multi-stage music separation network with dual-branch attention and hybrid convolution

A neural harmonic-aware network with gated attentive fusion for singing melody extraction

Frequency-Temporal Attention Network for Singing Melody Extraction

Singing Voice Separation and Vocal F0 Estimation based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation

An hrnet-blstm model with two-stage training for singing melody extraction

Vocal Melody Extraction Via Dnn-Based Pitch Estimation And Salience-Based Pitch Refinement

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Audiovisual Singing Voice Separation

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training