Abstract:Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is synthesizer sound matching, that is, automatically setting synthesizer parameters to imitate the input sound. This technology aims to simplify the music production process, making synthesizer programming easier for both beginners and experienced musicians, while providing new interaction methods, such as controlling the synthesizer by voice or reproducing the sounds in sampled tracks using one's own synthesizer. In the paper, a method based on the Audio Spectrogram Transformer (AST) is proposed to achieve synthesizer sound matching. By training the model on a large number of randomly generated sound samples, the researchers have shown that this method can effectively reconstruct parameters to generate new sounds similar to the input sounds. In addition, this model can also show good performance in audio inputs in different fields, such as imitating human voices and the sounds of other instruments. Specifically, the main contributions of the paper include: 1. **Proposing a general synthesizer sound - matching architecture**: This architecture is based on AST and can handle various different synthesizers without the need for special design for each synthesizer. 2. **Using a large - scale synthetic data set for training**: By generating a large number of random parameters and the corresponding audio samples, the researchers have created a data set containing 1 million samples for training the model. 3. **Evaluating the performance of the model**: By comparing with the multi - layer perceptron (MLP) and convolutional neural network (CNN) baseline models, the superior performance of the AST model in parameter prediction and audio reconstruction is demonstrated. These contributions indicate that the AST - based synthesizer sound - matching method has high practicality and flexibility, providing new ideas for the development of future music production tools.

Synthesizer Sound Matching Using Audio Spectrogram Transformers

Sound2Synth: Interpreting Sound Via FM Synthesizer Parameters Estimation

Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis

Contrastive Learning from Synthetic Audio Doppelgangers

NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis based on Frequency Modulation

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

One-Shot Acoustic Matching Of Audio Signals -- Learning to Hear Music In Any Room/ Concert Hall

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Learning to Solve Inverse Problems for Perceptual Sound Matching

Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

Volume-Independent Music Matching by Frequency Spectrum Comparison

Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario

Perceptual-Neural-Physical Sound Matching

Sounderfeit: Cloning a Physical Model with Conditional Adversarial Autoencoders

DiffMoog: a Differentiable Modular Synthesizer for Sound Matching

Optimization Techniques for a Physical Model of Human Vocalisation

Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model

Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling