Abstract:Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **Unsupervised Video Semantic Compression (UVSC)**. Specifically, although existing video compression methods perform well in terms of visual quality, they have poor performance when directly applied to downstream analysis tasks because these methods fail to specifically preserve the semantic information of videos during the compression process. In addition, supervised methods require time - consuming training for each specific task and perform poorly on other tasks, which limits their practical applications. To solve these problems, the authors propose a new framework **Free - VSC**, which absorbs rich semantic information by using pre - trained Visual Foundation Models (VFMs) to achieve more effective unsupervised video semantic compression. The main contributions of this framework include: 1. **Utilizing multiple Visual Foundation Models (VFMs)**: This is the first attempt to use VFMs for semantic compression to reuse their rich semantic representations. 2. **Introducing the Prompt - based Semantic Alignment Layer (Prom - SAL)**: This enables the framework to learn mutually enhancing semantics from multiple VFMs and effectively guides the compression model to specifically preserve the semantic information in videos. 3. **Proposing a trajectory - based entropy model**: By predicting the semantically adaptive trajectory of videos to remove inter - frame semantic redundancy, it provides better semantic compression efficiency than traditional methods. Through these innovations, Free - VSC outperforms existing methods on three mainstream tasks and six datasets, significantly improving the performance of compressed videos in various analysis tasks. ### Formulas involved 1. **Calculation of the semantic distortion term**: \[ D_{\text{sem}}=\ell_2(g^{\text{fine}}_n, V_n(X)^{\text{fine}})+\ell_2(g^{\text{coar}}_n, V_n(X)^{\text{coar}}) \] where $\ell_2$ represents the Mean Squared Error (MSE) loss. 2. **Bit - rate and distortion trade - off objective**: \[ L_{\text{RD - sem}} = QP\cdot\frac{1}{T}\sum_{t = 1}^{T}R(\hat{f}_t)+D_{\text{sem}} \] where $QP$ is the compression quality parameter, $R(\hat{f}_t)$ is the entropy of $\hat{f}_t$, and $D_{\text{sem}}$ is the semantic distortion term. 3. **Overall loss function**: \[ L = L_{\text{RD - sem}}+L_{\text{percep}}+L_{\text{GAN}} \] where $L_{\text{percep}}$ is the perceptual loss and $L_{\text{GAN}}$ is the Generative Adversarial Network (GAN) loss. These formulas ensure that Free - VSC achieves efficient compression while maintaining video semantics.

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Global and Compact Video Context Embedding for Video Semantic Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

FVC: An End-to-End Framework Towards Deep Video Compression in Feature Space

Towards Open-Vocabulary Video Semantic Segmentation

SMC++: Masked Learning of Unsupervised Video Semantic Compression

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Hierarchical Reinforcement Learning Based Video Semantic Coding for Segmentation

Collaborative Scalable Visual Compression for Human-Centered Videos.

HMFVC: A Human-Machine Friendly Video Compression Scheme

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Spatial-Temporal Transformer based Video Compression Framework

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

VQ-DeepVSC: A Dual-Stage Vector Quantization Framework for Video Semantic Communication

Beyond VVC: Towards Perceptual Quality Optimized Video Compression Using Multi-Scale Hybrid Approaches.

Video structural description technology for the new generation video surveillance systems

Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics

Semantically Video Coding: Instill Static-Dynamic Clues into Structured Bitstream for AI Tasks

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal