Abstract:Timbre is a set of perceptual attributes that identifies different types of sound sources. Although its definition is usually elusive, it can be seen from a signal processing viewpoint as all the spectral features that are perceived independently from pitch and loudness. Some works have studied high-level timbre synthesis by analyzing the feature relationships of different instruments, but acoustic properties remain entangled and generation bound to individual sounds. This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. Timbre transfer can be performed by encoding any variable-length input signals into the quantized latent features that are decoded according to the learned timbre. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments as an intuitive modality to drive sound synthesis. Furthermore, we can map the discrete latent space to acoustic descriptors and directly perform descriptor-based synthesis.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a more flexible and interpretable timbre synthesis method by learning the discrete quantization representation of a given timbre distribution. Specifically, the author hopes to make timbre transfer and sound synthesis more flexible and be able to be directly controlled by acoustic descriptors through introducing an auto - encoder model with a discrete latent space. The following are the main objectives and problems of this paper:
1. **Definition and Challenges of Timbre**:
- Timbre is a set of perceptual attributes that can distinguish different types of sound sources. Although its definition is often elusive, from the perspective of signal processing, it can be regarded as all spectral features independent of pitch and loudness.
- Although existing research has analyzed the feature relationships between different musical instruments, the acoustic characteristics are still intertwined, and the generated sounds are also limited to individual sounds.
2. **Limitations of Existing Methods**:
- Some works have studied high - level timbre synthesis, but the acoustic characteristics are still intertwined, and the generated sounds are also limited to individual sounds.
- Existing methods have not been fully evaluated in the signal domain, the acoustic characteristics are still intertwined, and there is a lack of decoupled representation of timbre.
3. **Proposed New Method**:
- This paper proposes a method based on Vector - Quantized Variational Auto - Encoder (VQ - VAE), aiming to learn the discrete quantization representation of a given timbre distribution. This method makes timbre transfer more flexible and can directly control the synthesis through acoustic descriptors.
- The model uses a discrete latent space, which is decoupled from loudness, to learn the quantization representation of timbre.
- Timbre transfer can be achieved by encoding an input signal of arbitrary length into discrete latent features and then decoding according to the learned timbre.
4. **Specific Application Scenarios**:
- The paper shows timbre transfer experiments from orchestral instruments to human voices and from voice imitation to musical instruments.
- A method of mapping from the discrete latent space to acoustic descriptors is proposed, so that synthesis can be directly based on descriptors.
### Summary
The problem that this paper attempts to solve is how to achieve more flexible and controllable timbre transfer and sound synthesis by learning the discrete quantization representation of timbre. Specifically, the author proposes a model based on VQ - VAE, which can flexibly transform and generate sounds of different timbres while keeping pitch and loudness unchanged. In addition, this model also supports directly controlling the synthesis process through acoustic descriptors, improving the flexibility and interpretability of timbre synthesis.