Synthesizer Sound Matching Using Audio Spectrogram Transformers

Fred Bruford,Frederik Blang,Shahan Nercessian
2024-07-24
Abstract:Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is synthesizer sound matching, that is, automatically setting synthesizer parameters to imitate the input sound. This technology aims to simplify the music production process, making synthesizer programming easier for both beginners and experienced musicians, while providing new interaction methods, such as controlling the synthesizer by voice or reproducing the sounds in sampled tracks using one's own synthesizer. In the paper, a method based on the Audio Spectrogram Transformer (AST) is proposed to achieve synthesizer sound matching. By training the model on a large number of randomly generated sound samples, the researchers have shown that this method can effectively reconstruct parameters to generate new sounds similar to the input sounds. In addition, this model can also show good performance in audio inputs in different fields, such as imitating human voices and the sounds of other instruments. Specifically, the main contributions of the paper include: 1. **Proposing a general synthesizer sound - matching architecture**: This architecture is based on AST and can handle various different synthesizers without the need for special design for each synthesizer. 2. **Using a large - scale synthetic data set for training**: By generating a large number of random parameters and the corresponding audio samples, the researchers have created a data set containing 1 million samples for training the model. 3. **Evaluating the performance of the model**: By comparing with the multi - layer perceptron (MLP) and convolutional neural network (CNN) baseline models, the superior performance of the AST model in parameter prediction and audio reconstruction is demonstrated. These contributions indicate that the AST - based synthesizer sound - matching method has high practicality and flexibility, providing new ideas for the development of future music production tools.