Abstract:A multichannel extension to the RVQGAN neural coding method is proposed, and realized for data-driven compression of third-order Ambisonics audio. The input- and output layers of the generator and discriminator models are modified to accept multiple (16) channels without increasing the model bitrate. We also propose a loss function for accounting for spatial perception in immersive reproduction, and transfer learning from single-channel models. Listening test results with 7.1.4 immersive playback show that the proposed extension is suitable for coding scene-based, 16-channel Ambisonics content with good quality at 16 kbit/s.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently compress third - order Ambisonics audio (an advanced surround - sound technology for immersive audio) in order to achieve high - quality multi - channel audio transmission and storage at a low bit rate. Specifically, the authors propose a multi - channel extension method based on RVQGAN (Residual Vector - Quantized Generative Adversarial Network) to adapt to 16 - channel third - order Ambisonics audio and consider spatial perception characteristics by introducing a new loss function.
### Main Problems and Solutions
1. **Efficient Compression of Multi - channel Audio**
- **Problem**: Traditional audio compression methods are less efficient when dealing with multi - channel audio. Especially for high - order Ambisonics audio, which has a large number of channels (for example, 16 channels in the third - order), resulting in a large amount of data and high bit - rate requirements.
- **Solution**: The authors propose a multi - channel - extended RVQGAN neural coding method. By modifying the input and output layers of the generator and discriminator models, it can handle multiple (16) channels without increasing the bit rate of the model.
2. **Considering Spatial Perception Characteristics**
- **Problem**: In immersive audio, spatial perception is very important, but existing compression methods usually ignore this.
- **Solution**: The authors introduce a new loss function specifically for evaluating and optimizing the quality of spatial perception, ensuring that the compressed audio can still maintain a good spatial effect during immersive playback.
3. **Transfer Learning from Monophonic Models**
- **Problem**: Training multi - channel models requires a large amount of data and computing resources, and direct training from scratch can be time - consuming and resource - intensive.
- **Solution**: The authors propose a method of transfer learning from pre - trained monophonic models, which reduces training time and resource requirements while improving the initial performance of the model.
### Specific Methods
- **Model Architecture**: By modifying the input and output layers of RVQGAN, the model can handle 16 - channel audio signals while keeping the dimension of the bottleneck part of the model unchanged, thus maintaining compression efficiency.
- **Loss Function**: A new loss function is introduced to evaluate the spatial perception quality of multi - channel audio, including adversarial loss, feature - matching loss, VQ codebook loss, and reconstruction loss. In particular, covariance loss is introduced to preserve the correlation between channels, which is very important for perceiving spatial impressions.
- **Transfer Learning**: Transfer learning from monophonic to multi - channel models is achieved by copying the convolutional weights of the pre - trained monophonic model to the input and output layers of the multi - channel model, which accelerates the training process and improves the final performance.
### Experimental Results
Through auditory tests with a 7.1.4 immersive speaker layout, the results show that this method can achieve a "good" MUSHRA score at a bit rate of 16 kbit/s, outperforming the performance of traditional methods at 160 kbit/s, proving the effectiveness and superiority of this method at a low bit rate.
### Summary
This paper successfully solves the problem of efficient compression of multi - channel Ambisonics audio at a low bit rate, especially making significant progress in spatial perception, providing a new solution for the transmission and storage of immersive audio.