Abstract: Singing voice detection is the task to identify the frames which contain the singer vocal or not. It has been one of the main components in music information retrieval (MIR), which can be applicable to melody extraction, artist recognition, and music discovery in popular music. Although there are several methods which have been proposed, a more robust and more complete system is desired to improve the detection performance. In this paper, our motivation is to provide an extensive comparison in different stages of singing voice detection. Based on the analysis a novel method was proposed to build a more efficiently singing voice detection system. In the proposed system, there are main three parts. The first is a pre-process of singing voice separation to extract the vocal without the music. The improvements of several singing voice separation methods were compared to decide the best one which is integrated to singing voice detection system. And the second is a deep neural network based classifier to identify the given frames. Different deep models for classification were also compared. The last one is a post-process to filter out the anomaly frame on the prediction result of the classifier. The median filter and Hidden Markov Model (HMM) based filter as the post process were compared. Through the step by step module extension, the different methods were compared and analyzed. Finally, classification performance on two public datasets indicates that the proposed approach which based on the Long-term Recurrent Convolutional Networks (LRCN) model is a promising alternative.

Singing Voice Detection Via Similarity-Based Semi-Supervised Learning.

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm

Audiovisual Singing Voice Separation

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

Single-Channel Blind Source Separation for Singing Voice Detection: A Comparative Study

Voice activity detection in the wild: A data-driven approach using teacher-student training

Reducing Manual Labeling in Singing Voice Detection: an Active Learning Approach

A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

Learning the Beauty in Songs: Neural Singing Voice Beautifier

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher