Abstract:Reverberation is a key element in spatial audio perception, historically achieved with the use of analogue devices, such as plate and spring reverb, and in the last decades with digital signal processing techniques that have allowed different approaches for Virtual Analogue Modelling (VAM). The electromechanical functioning of the spring reverb makes it a nonlinear system that is difficult to fully emulate in the digital domain with white-box modelling techniques. In this study, we compare five different neural network architectures, including convolutional and recurrent models, to assess their effectiveness in replicating the characteristics of this audio effect. The evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz. This paper specifically focuses on neural audio architectures that offer parametric control, aiming to advance the boundaries of current black-box modelling techniques in the domain of spring reverberation.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively replicate the characteristics of spring reverb using different neural network architectures, especially to simulate the complex behavior of such a nonlinear system in the digital domain. Specifically, the research aims to evaluate the effectiveness of five different neural network architectures (including convolutional and recurrent models) in capturing the unique acoustic characteristics of spring reverb, and conduct a systematic comparison through two datasets with different sampling rates (16 kHz and 48 kHz).
### Research Background
Spring reverb is an important element of spatial audio perception, and it has been traditionally achieved through analog devices (such as plate reverb and spring reverb). In recent years, with the development of digital signal processing technology, virtual analogue modelling (VAM) has become a new method. However, due to the electromechanical working principle of spring reverb, which makes it a nonlinear system, traditional white - box modelling techniques are difficult to fully and accurately simulate its characteristics in the digital domain.
### Research Objectives
This paper focuses particularly on neural audio architectures with parameter control, aiming to push the boundaries of current black - box modelling techniques in the field of spring reverb. By comparing the performance of different neural network architectures, the research hopes to find the best model that can achieve real - time processing in high - fidelity audio applications.
### Main Contributions
1. **Model Comparison**: Evaluated the capabilities of five different neural network architectures (TCN, WaveNet, GCN, LSTM, and GRU) in replicating the characteristics of spring reverb.
2. **Dataset Usage**: Used two public datasets (SpringSet and EGFxSet) to conduct experiments at sampling rates of 16 kHz and 48 kHz respectively.
3. **Performance Evaluation**: Evaluated model performance through quantitative indicators (such as ESR, MRSTFT, and RTF) to ensure the reproducibility and transparency of the results.
### Conclusions
The research shows that the WaveNet model performs excellently at a sampling rate of 16 kHz and can well capture the subtle features of spring reverb; while the GCN model performs best at a sampling rate of 48 kHz, not only outperforming other models in the MRSTFT indicator, but also showing advantages in real - time processing capabilities. This provides strong support for real - time modelling of high - fidelity audio effects in the future.
### Formula Summary
- **Total Loss Function**:
\[
L = L_{\text{SmoothL1}}+L_{\text{STFT}}
\]
- **ESR Calculation Formula**:
\[
L_{\text{ESR}}=\frac{\sum_{i = 0}^{N - 1}|y_i-\hat{y}_i|^2}{\sum_{i = 0}^{N - 1}|y_i|^2}
\]
- **Multi - Resolution STFT Loss Function**:
\[
L_{\text{MRSTFT}}(\hat{y},y)=\sum_{m = 1}^{M}(l_m^{\text{SC}}(\hat{y},y)+\alpha l_m^{\text{SM}}(\hat{y},y))
\]
where \(M\) is the total number of resolutions, \(\alpha\) is the weight factor of the log - magnitude loss, and \(|y|\) and \(|\hat{y}|\) represent the magnitudes of the true value and the predicted value respectively.
Through these formulas and experimental results, the paper provides valuable insights into the application of neural networks in audio effect modelling.