WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion

Teysir Baoueb,Xiaoyu Bie,Hicham Janati,Gael Richard
2024-09-06
Abstract:As diffusion-based deep generative models gain prevalence, researchers are actively investigating their potential applications across various domains, including music synthesis and style alteration. Within this work, we are interested in timbre transfer, a process that involves seamlessly altering the instrumental characteristics of musical pieces while preserving essential musical elements. This paper introduces WaveTransfer, an end-to-end diffusion model designed for timbre transfer. We specifically employ the bilateral denoising diffusion model (BDDM) for noise scheduling search. Our model is capable of conducting timbre transfer between audio mixtures as well as individual instruments. Notably, it exhibits versatility in that it accommodates multiple types of timbre transfer between unique instrument pairs in a single model, eliminating the need for separate model training for each pairing. Furthermore, unlike recent works limited to 16 kHz, WaveTransfer can be trained at various sampling rates, including the industry-standard 44.1 kHz, a feature of particular interest to the music community.
Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the problem of timbre transfer in the field of music signal processing. Specifically, the researchers utilize a diffusion model to seamlessly change the instrumental characteristics of music segments while preserving their fundamental musical elements, such as pitch and rhythmic structure. The paper introduces a novel end-to-end diffusion model called WaveTransfer, which is specifically designed for timbre transfer and is capable of handling timbre conversion between audio mixtures as well as between individual instruments. A notable feature of this model is its ability to accommodate timbre conversion between multiple pairs of instruments within a single model, without the need to train separate models for each pair of instruments. Additionally, unlike recent studies that are limited to a 16kHz sampling rate, WaveTransfer can support various sampling rates, including the music industry standard of 44.1kHz. By introducing the Bilateral Denoising Diffusion Model (BDDM), the researchers optimized the noise scheduling strategy, enhancing the model's efficiency and performance.