Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Siyuan Hou,Shansong Liu,Ruibin Yuan,Wei Xue,Ying Shan,Mangsuo Zhao,Chao Zhang
2024-10-08
Abstract:Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-$k$ constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing. Audio examples can be found at <a class="link-external link-https" href="https://stable-audio-control.github.io/web/" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Quality and length of generated music**: Existing controllable music generation and editing methods have limitations when generating high-quality and long-duration music. These methods typically rely on Mel-spectrogram representations and UNet-based model structures, which significantly constrain the length and quality of the generated music. Specifically, the fixed-length output of Mel-spectrograms and the errors introduced during the transformation process make it difficult to generate precise, variable-length audio. 2. **Precision and flexibility of melody control**: Existing methods often fail to include complete melody information when using melody prompts, resulting in poor melody retention. For example, some methods use one-hot 12-pitch-class chromagram as melody prompts, which struggle to capture pitch variations across multiple octaves and cannot effectively represent the melody of multi-track music. To address these issues, the authors propose a new method that uses Diffusion Transformer (DiT) combined with ControlNet and introduces a new melody representation method—top-k constant-Q Transform (top-k CQT). This method aims to achieve long-duration, variable-length music generation and editing controlled by text and melody prompts, while improving the precision and flexibility of melody control.