MFCC-GAN Codec: A New AI-based Audio Coding

Mohammad Reza Hasanabadi
2023-10-22
Abstract:In this paper, we proposed AI-based audio coding using MFCC features in an adversarial setting. We combined a conventional encoder with an adversarial learning decoder to better reconstruct the original waveform. Since GAN gives implicit density estimation, therefore, such models are less prone to overfitting. We compared our work with five well-known codecs namely AAC, AC3, Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps. MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is equal to that of AC3_128k, and AAC_112k while having a significantly lower bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k. For future work, we finally suggest adopting loss functions optimizing intelligibility and perceptual metrics in the MFCCGAN structure to improve quality and intelligibility simultaneously.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
This paper proposes a novel audio coding method based on Mel-Frequency Cepstral Coefficients (MFCC) features and Generative Adversarial Networks (GAN), called the MFCC-GAN Codec. The authors aim to address the issue of signal reconstruction distortion in traditional audio coding, especially under low bitrates. By combining traditional encoders with GAN-based learning decoders, this method can better reconstruct the original audio waveform from the extracted MFCC features. Specifically, the MFCC-GAN Codec has made improvements in the following areas: 1. **Feature Extraction**: MFCC is used as the input feature, which is commonly employed in speech recognition and music information retrieval tasks. 2. **Generative Adversarial Network (GAN)**: The generator and discriminator structures in GAN are utilized to optimize the reconstruction of audio waveforms. The generator is responsible for generating audio waveforms from MFCC features, while the discriminator is used to distinguish between real and generated audio, training the generator to produce audio signals closer to the real ones. 3. **Experimental Results**: The authors compared the proposed MFCC-GAN Codec with five widely used audio codecs (AAC, AC3, Opus, Vorbis, and Speex) and tested them at different bitrates. The results show that even at lower bitrates, the MFCC-GAN Codec can achieve or exceed the performance of traditional codecs in terms of Signal-to-Noise Ratio (SNR), Naturalness Quality Assessment (NISQA-MOS), and other metrics. 4. **Future Work**: The authors suggest that future work could further optimize the loss function to better consider intelligibility and perceptual quality metrics, thereby improving overall audio quality. In summary, the MFCC-GAN Codec, by introducing a GAN-based learning decoder, can achieve high-quality audio reconstruction at lower bitrates, providing new insights for the development of audio coding technology.