Abstract:Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We evaluate several deep learning-based source separation models on this task using simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence the downstream transcription asks of speech recognition for speech and audio tagging for music and SFX. We also investigate the task of activity detection on the three sources as a way to further improve source separation and transcription. While we observe that source separation improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.

Mixing or Extracting? Further Exploring Necessity of Music Separation for Singer Identification

Singer separation for karaoke content generation

Towards Solving The Bottleneck Of Pitch-Based Singing Voice Separation

Audiovisual Singing Voice Separation

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

A Novel Framework for Efficient Automated Singer Identification in Large Music Databases

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments

Towards Efficient Automated Singer Identification in Large Music Databases.

A Novel Singer Identification Method Using GMM-UBM

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Voice and accompaniment separation in music using self-attention convolutional neural network

Deep Learning Based Source Separation Applied To Choir Ensembles

Zero-Shot Duet Singing Voices Separation with Diffusion Models

Improving Real-Time Music Accompaniment Separation with MMDenseNet