Abstract: Singing voice detection is the task to identify the frames which contain the singer vocal or not. It has been one of the main components in music information retrieval (MIR), which can be applicable to melody extraction, artist recognition, and music discovery in popular music. Although there are several methods which have been proposed, a more robust and more complete system is desired to improve the detection performance. In this paper, our motivation is to provide an extensive comparison in different stages of singing voice detection. Based on the analysis a novel method was proposed to build a more efficiently singing voice detection system. In the proposed system, there are main three parts. The first is a pre-process of singing voice separation to extract the vocal without the music. The improvements of several singing voice separation methods were compared to decide the best one which is integrated to singing voice detection system. And the second is a deep neural network based classifier to identify the given frames. Different deep models for classification were also compared. The last one is a post-process to filter out the anomaly frame on the prediction result of the classifier. The median filter and Hidden Markov Model (HMM) based filter as the post process were compared. Through the step by step module extension, the different methods were compared and analyzed. Finally, classification performance on two public datasets indicates that the proposed approach which based on the Long-term Recurrent Convolutional Networks (LRCN) model is a promising alternative.

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Towards Solving The Bottleneck Of Pitch-Based Singing Voice Separation

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Latent time-frequency component analysis: A novel pitch-based approach for singing voice separation

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio

Singing Voice Separation and Vocal F0 Estimation based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation

Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

Audiovisual Singing Voice Separation

Singer separation for karaoke content generation

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

Mixing or Extracting? Further Exploring Necessity of Music Separation for Singer Identification

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

Transductive Nonnegative Matrix Factorization for Semi-Supervised High-Performance Speech Separation

DJCM: A Deep Joint Cascade Model for Singing Voice Separation and Vocal Pitch Estimation

3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Multi-stage music separation network with dual-branch attention and hybrid convolution

A Novel Singer Identification Method Using GMM-UBM