Abstract:In this paper, we proposed AI-based audio coding using MFCC features in an adversarial setting. We combined a conventional encoder with an adversarial learning decoder to better reconstruct the original waveform. Since GAN gives implicit density estimation, therefore, such models are less prone to overfitting. We compared our work with five well-known codecs namely AAC, AC3, Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps. MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is equal to that of AC3_128k, and AAC_112k while having a significantly lower bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k. For future work, we finally suggest adopting loss functions optimizing intelligibility and perceptual metrics in the MFCCGAN structure to improve quality and intelligibility simultaneously.

What problem does this paper attempt to address?

This paper proposes a novel audio coding method based on Mel-Frequency Cepstral Coefficients (MFCC) features and Generative Adversarial Networks (GAN), called the MFCC-GAN Codec. The authors aim to address the issue of signal reconstruction distortion in traditional audio coding, especially under low bitrates. By combining traditional encoders with GAN-based learning decoders, this method can better reconstruct the original audio waveform from the extracted MFCC features. Specifically, the MFCC-GAN Codec has made improvements in the following areas: 1. **Feature Extraction**: MFCC is used as the input feature, which is commonly employed in speech recognition and music information retrieval tasks. 2. **Generative Adversarial Network (GAN)**: The generator and discriminator structures in GAN are utilized to optimize the reconstruction of audio waveforms. The generator is responsible for generating audio waveforms from MFCC features, while the discriminator is used to distinguish between real and generated audio, training the generator to produce audio signals closer to the real ones. 3. **Experimental Results**: The authors compared the proposed MFCC-GAN Codec with five widely used audio codecs (AAC, AC3, Opus, Vorbis, and Speex) and tested them at different bitrates. The results show that even at lower bitrates, the MFCC-GAN Codec can achieve or exceed the performance of traditional codecs in terms of Signal-to-Noise Ratio (SNR), Naturalness Quality Assessment (NISQA-MOS), and other metrics. 4. **Future Work**: The authors suggest that future work could further optimize the loss function to better consider intelligibility and perceptual quality metrics, thereby improving overall audio quality. In summary, the MFCC-GAN Codec, by introducing a GAN-based learning decoder, can achieve high-quality audio reconstruction at lower bitrates, providing new insights for the development of audio coding technology.

MFCC-GAN Codec: A New AI-based Audio Coding

MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

A Generative Adversarial Net-Based Bandwidth Extension Method for Audio Compression

A High Fidelity and Low Complexity Neural Audio Coding

MusicHiFi: Fast High-Fidelity Stereo Vocoding

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios

Gull: A Generative Multifunctional Audio Codec

High-Fidelity Audio Compression with Improved RVQGAN

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm

GAN-based Image Compression with Improved RDO Process

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement