C3: High-performance and low-complexity neural compression from a single image or video

Hyunjik Kim,Matthias Bauer,Lucas Theis,Jonathan Richard Schwarz,Emilien Dupont
2023-12-05
Abstract:Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.
Image and Video Processing,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the rate - distortion (RD) performance of neural compression methods while maintaining low decoding complexity. Specifically, most existing neural compression models need to be trained on large - scale datasets in order to generalize to unseen data, which usually requires the use of complex and computationally expensive architectures. This high decoding complexity limits the application of these models on resource - constrained devices, such as mobile devices. Therefore, designing codecs that can provide strong RD performance while maintaining low decoding complexity is a major challenge in the field of neural compression. The method introduced in the paper, called C3, solves this problem by over - fitting a small model to each individual image or video. This method enables the decoding complexity of C3 to be an order of magnitude lower than that of neural baseline models with similar RD performance. C3 has shown significant improvements in both image and video compression. In particular, in the CLIC2020 image benchmark test and the UVG video benchmark test, its RD performance is comparable to that of VTM and Video Compression Transformer (VCT), but its decoding complexity is much lower. ### Main contributions of the paper: 1. **Low decoding complexity**: C3 significantly reduces the decoding complexity by over - fitting small models to each individual image or video. 2. **Strong RD performance**: In the CLIC2020 image benchmark test, the RD performance of C3 is close to that of VTM, and in the UVG video benchmark test, its RD performance is comparable to that of VCT. 3. **Method innovation**: C3 has made several simple and effective improvements to COOL - CHIC, including improvements in optimization, quantization, and architecture, which together improve the performance of the model. 4. **Video compression extension**: C3 is not only applicable to images, but also successfully extended to video compression, proposing specific methods for videos, such as 3D parameters and operations, custom masks, etc. ### Key technical points: - **Soft - Rounding**: Introduce the soft - rounding function in the optimization process to better approximate the quantization process. - **Kumaraswamy noise**: Use samples of the Kumaraswamy distribution instead of uniform noise to handle quantization errors more flexibly. - **Cosine decay schedule**: Use the cosine decay schedule in the first stage of optimization to adjust the learning rate. - **Smaller quantization step size**: Use a smaller step size in the quantization process to avoid instability or sub - optimal optimization caused by too large input values. - **Conditional entropy model**: Allow the context to contain the values of the previous grid and use FiLM layers to make the network resolution - dependent. - **GELU activation function**: Replace the ReLU activation function to improve the model's expressive ability. - **Adaptive learning rate**: Adaptively reduce the learning rate in the second stage of optimization to further improve the RD loss. Through these improvements, C3 significantly improves the RD performance of neural compression methods while maintaining low decoding complexity, providing a new direction for research in the field of neural compression.