Abstract:Singing melody extraction from polyphonic musical audio is one of the most challenging tasks in music information retrieval (MIR). Recently, data-driven methods based on convolutional neural networks (CNNs) have achieved great success for this task. In the literature, harmonic relationship has been proven crucial for this task. However, few existing CNN-based singing melody extraction methods consider the harmonic relationship in the training stage. The state-of-the-art CNN based methods are not capable of capturing such long-dependency harmonic relationship due to limited receptive field and unacceptable computation cost. In this paper, we introduce a neural harmonic-aware network with gated attentive fusion (NHAN-GAF) for singing melody extraction. Specifically, in the 2-D spectrograms modeling branch, we propose to employ multiple parallel 1-D CNN kernels to capture the harmonic relations between 1–2 octaves along the frequency axis in the spectrogram. Considering the advantage of jointly using Time–Frequency (T-F) domain and time domain information, we use two-branch neural nets to learn discriminative representation for this task. A novel gated attentive fusion (GAF) network is suggested to encode potential correlations between the two branches and fuse the descriptors learned from raw waveform and T-F spectrograms. Moreover, the idea of GAF can be exploited to the multimedia applications with multimodal analysis. With the two proposed components, our proposed model is capable of learning the harmonic relationship in the spectrogram and better capturing the contextual but discriminative features for singing melody extraction. We use part of the vocal tracks of the RWC dataset and MIR-1 K dataset to train the model and evaluate the performance of the proposed model on the ADC 2004, MIREX 05 and MedleyDB datasets. The experimental results show that the proposed method outperforms the state-of-the-art ones.

Improved harmonic spectral envelope extraction for singer classification with hybridised model

Ensemble Model-Based Singer Classification with Proposed Vocal Segmentation

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

A neural harmonic-aware network with gated attentive fusion for singing melody extraction

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features

Self-Supervised Representations for Singing Voice Conversion

Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation

Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction

Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Singer separation for karaoke content generation

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

HANME: Hierarchical Attention Network for Singing Melody Extraction

Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion

Towards Efficient Automated Singer Identification in Large Music Databases.

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model