Abstract:Although the recent learning-based image and video coding techniques achieve rapid development, the signal fidelity-driven target in these methods leads to the divergence to a highly effective and efficient coding framework for both human and machine. In this paper, we aim to address the issue by making use of the power of generative models to bridge the gap between full fidelity (for human vision) and high discrimination (for machine vision). Therefore, relying on existing pretrained generative adversarial networks (GAN), we build a GAN inversion framework that projects the image into a low-dimensional natural image manifold. In this manifold, the feature is highly discriminative and also encodes the appearance information of the image, named as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">latent code</i> . Taking a variational bit-rate constraint with a hyperprior model to model/suppress the entropy of image manifold code, our method is capable of fulfilling the needs of both machine and human visions at very low bit-rates. To improve the visual quality of image reconstruction, we further propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multiple latent codes</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scalable inversion</i> . The former gets several latent codes in the inversion, while the latter additionally compresses and transmits a shallow compact feature to support visual reconstruction. Experimental results demonstrate the superiority of our method in both human vision tasks, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e</i> . image reconstruction, and machine vision tasks, including semantic parsing and attribute prediction.

MS-Glance: Bio-Insipred Non-semantic Context Vectors and their Applications in Supervising Image Reconstruction

Alleviating the Semantic Gap for Generalized Fmri-to-image Reconstruction.

Rethinking Visual Reconstruction: Experience-Based Content Completion Guided by Visual Cues

Semantic Reconstruction based on RGB Image and Sparse Depth

Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

Image-text Retrieval via Preserving Main Semantics of Vision

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

GlanceSeg: Real-time microaneurysm lesion segmentation with gaze-map-guided foundation model for early detection of diabetic retinopathy

WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity

Facial Image Compression via Neural Image Manifold Compression

Context-Aware Optimal Transport Learning for Retinal Fundus Image Enhancement

GLMHA A Guided Low-rank Multi-Head Self-Attention for Efficient Image Restoration and Spectral Reconstruction

SCGA‐Net: Skip Connections Global Attention Network for Image Restoration

Mind-bridge: Reconstructing Visual Images Based on Diffusion Model from Human Brain Activity

GL-Segnet: Global-Local representation learning net for medical image segmentation

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain