Abstract:Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at <a class="link-external link-https" href="https://audiobox.metademolab.com/" rel="external noopener nofollow">this https URL</a>

What Do I Hear? Generating Sounds for Visuals with ChatGPT

Visual to Sound: Generating Natural Sound for Videos in the Wild

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Leveraging AI to Generate Audio for User-generated Content in Video Games

Audiobox: Unified Audio Generation with Natural Language Prompts

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Diverse and Vivid Sound Generation from Text Descriptions

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations

ChatGPT: Revolutionizing User Interactions with Advanced Natural Language Processing

Generating Realistic Images from In-the-wild Sounds

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Creative Text-to-Audio Generation via Synthesizer Programming

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

SoundScape: A Human-AI Co-Creation System Making Your Memories Heard

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations