Abstract:Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at <a class="link-external link-https" href="https://audiobox.metademolab.com/" rel="external noopener nofollow">this https URL</a>

Jukebox: A Generative Model for Music

Generating Rhythm Game Music with Jukebox

JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Simple and Controllable Music Generation

DeepJ: Style-Specific Music Generation

StemGen: A music generation model that listens

A Novel Audio Representation for Music Genre Identification in MIR

Audiobox: Unified Audio Generation with Natural Language Prompts

A Generative Model for Raw Audio Using Transformer Architectures

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Audio Conditioning for Music Generation via Discrete Bottleneck Features

MGU-V: A Deep Learning Approach for Lo-Fi Music Generation Using Variational Autoencoders With State-of-the-Art Performance on Combined MIDI Datasets

VampNet: Music Generation via Masked Acoustic Token Modeling

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

The Jazz Transformer on the Front Line: Exploring the Shortcomings of AI-composed Music through Quantitative Measures

Song From PI: A Musically Plausible Network for Pop Music Generation

Evaluating Deep Music Generation Methods Using Data Augmentation

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale