Abstract:This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at <a class="link-external link-https" href="https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX" rel="external noopener nofollow">this https URL</a>.

Improving Drum Source Separation with Temporal-Frequency Statistical Descriptors

Toward Deep Drum Source Separation

Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)

Towards multi-instrument drum transcription

Hierarchic Temporal Convolutional Network With Cross-Domain Encoder for Music Source Separation

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks

Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset

Music Source Separation With Band-Split RNN

The whole is greater than the sum of its parts: improving music source separation by bridging networks

ADTOF: A large dataset of non-synthetic music for automatic drum transcription

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Deep Unsupervised Drum Transcription

A Two-Stage Band-Split Mamba-2 Network For Music Separation

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Domestic Activities Classification from Audio Recordings Using Multi-scale Dilated Depthwise Separable Convolutional Network

Improving Universal Sound Separation Using Sound Classification

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion