Abstract:Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in long-range modeling dependencies due to their local connectivity and an increasing number of architectural biases and priors. On the contrary, the proposed ICT can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed adaptive image compression transformer (AICT) framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the efficiency of image compression while reducing the complexity of the decoder. Specifically, the paper proposes an Adaptive Image Compression Transformer (AICT), aiming to improve the existing entropy - coding methods based on Convolutional Networks (ConvNet). These methods have limitations in handling long - distance dependencies due to their local - connection characteristics. AICT extracts compact latent representations more accurately and reconstructs high - quality images by introducing a more effective Transformer - based channel autoregressive prior model and by using a learnable scaling module and ConvNeXt - based pre - /post - processors. Experimental results show that the AICT framework significantly outperforms the Versatile Video Coding (VVC) reference encoder (VTM - 18.0) and the neural codec SwinT - ChARM on multiple benchmark datasets, especially in terms of the trade - off between coding efficiency and decoder complexity. The key innovation points in the paper include: - **Proposing a new Image Compression Transformer (ICT)**: This non - linear transform - coding and spatial - channel autoregressive entropy - coding module, based on Swin Transformer blocks, can effectively reduce the correlation of latent variables and has a more flexible receptive field to adapt to contexts requiring short/long - distance information. - **Introducing the Adaptive Image Compression Transformer (AICT) model**: Using a scale - adaptation module as a sandwich processor to enhance compression efficiency. This module consists of a neural scaling network and ConvNeXt - based pre - /post - processors, which jointly optimize different differentiable adjustment layers and content - related adjustment factor estimators. - **Conducting extensive experimental verification**: Experiments were carried out on four widely - used benchmark datasets to explore possible sources of coding gain and to demonstrate the effectiveness of AICT. In addition, model - expansion analysis and ablation studies were also carried out to prove the rationality of the architectural decisions. These contributions enable AICT to achieve higher compression efficiency than existing methods while maintaining a lower decoding time, thus potentially helping with high - efficiency real - time visual data compression.

AICT: An Adaptive Image Compression Transformer

Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression

Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression

Expanding the Effective Receptive Field for Learned Image Compression

Towards End-to-End Image Compression and Analysis with Transformers

Enhanced Residual SwinV2 Transformer for Learned Image Compression

Unified Visual Transformer Compression

CTFCD: Channel Transformer Based on Full Convolutional Decoder for Single Image Deraining

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Multi-rate Adaptive Transform Coding for Video Compression

An End-to-End Video Coding Method Via Adaptive Vision Transformer

Convolutional Transformer-Based Image Compression

Efficient Semantic Communication Through Transformer-Aided Compression

LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression

Learned Image Compression with Mixed Transformer-CNN Architectures

Frequency-Aware Transformer for Learned Image Compression

Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression