Abstract:We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at <a class="link-external link-https" href="https://team.doubao.com/seed-music" rel="external noopener nofollow">this https URL</a> "<a class="link-external link-https" href='https://team.doubao.com/seed-music"' rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address several key challenges in music creation, including: 1. **Domain Complexity**: Music signals are highly complex, requiring both short-term melodic coherence and long-term structural consistency. Vocal music is particularly complex because it involves overlapping sounds across a wide frequency range, extensive vocal ranges, and rich expressive techniques. 2. **Evaluation Difficulty**: Assessing the artistic quality of music generation models typically requires domain expertise, including the appeal of the melody, the consistency of chord progressions, the authenticity of the structure, and the expressiveness of the vocals. These artistic elements are deeply influenced by cultural and regional differences, making them very difficult to quantify. 3. **Data Complexity**: Generative models require annotated music data to learn how to generate outputs based on conditions such as lyrics, style, instruments, and song structure. However, music annotation requires professional musical background knowledge, not just simple speech transcription or image annotation. 4. **Diverse User Needs**: Different users have vastly different music creation needs. Beginners might need to generate complete audio segments from text prompts, while professional producers might need more granular control over their work, such as editing individual instrument tracks. To address these issues, the paper proposes the Seed-Music framework, which aims to contribute through the following points: - **Unified Framework**: Combining autoregressive language modeling and diffusion model approaches to support high-quality vocal music generation, with the ability to control based on multimodal inputs such as lyrics, style descriptions, audio references, scores, and voice prompts. - **Fine-Grained Editing**: Proposing a diffusion model-based method that supports direct editing of lyrics, melody, and timbre in existing music audio tracks. - **Zero-Shot Vocal Conversion**: Introducing a new zero-shot vocal conversion method that allows vocal conversion with just a 10-second singing or speech recording provided by the user. Through these methods, Seed-Music aims to lower the barriers to music creation, enabling both beginners and professionals to benefit at different stages of music production.

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Simple and Controllable Music Generation

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

Efficient Neural Music Generation

A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions

StemGen: A music generation model that listens

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

SongCreator: Lyrics-based Universal Song Generation

Content-based Controls For Music Large Language Modeling

Music ControlNet: Multiple Time-varying Controls for Music Generation

A review of intelligent music generation systems

Personalized Popular Music Generation Using Imitation and Structure

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Deep Learning-Based Music Generation

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Musical Elements Enhancement and Image Content Preservation Network for Image to Music Generation

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

The Usage of Artificial Intelligence Technology in Music Education System Under Deep Learning