Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Yuan Tian,Guo Lu,Guangtao Zhai
2024-09-22
Abstract:Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **Unsupervised Video Semantic Compression (UVSC)**. Specifically, although existing video compression methods perform well in terms of visual quality, they have poor performance when directly applied to downstream analysis tasks because these methods fail to specifically preserve the semantic information of videos during the compression process. In addition, supervised methods require time - consuming training for each specific task and perform poorly on other tasks, which limits their practical applications. To solve these problems, the authors propose a new framework **Free - VSC**, which absorbs rich semantic information by using pre - trained Visual Foundation Models (VFMs) to achieve more effective unsupervised video semantic compression. The main contributions of this framework include: 1. **Utilizing multiple Visual Foundation Models (VFMs)**: This is the first attempt to use VFMs for semantic compression to reuse their rich semantic representations. 2. **Introducing the Prompt - based Semantic Alignment Layer (Prom - SAL)**: This enables the framework to learn mutually enhancing semantics from multiple VFMs and effectively guides the compression model to specifically preserve the semantic information in videos. 3. **Proposing a trajectory - based entropy model**: By predicting the semantically adaptive trajectory of videos to remove inter - frame semantic redundancy, it provides better semantic compression efficiency than traditional methods. Through these innovations, Free - VSC outperforms existing methods on three mainstream tasks and six datasets, significantly improving the performance of compressed videos in various analysis tasks. ### Formulas involved 1. **Calculation of the semantic distortion term**: \[ D_{\text{sem}}=\ell_2(g^{\text{fine}}_n, V_n(X)^{\text{fine}})+\ell_2(g^{\text{coar}}_n, V_n(X)^{\text{coar}}) \] where $\ell_2$ represents the Mean Squared Error (MSE) loss. 2. **Bit - rate and distortion trade - off objective**: \[ L_{\text{RD - sem}} = QP\cdot\frac{1}{T}\sum_{t = 1}^{T}R(\hat{f}_t)+D_{\text{sem}} \] where $QP$ is the compression quality parameter, $R(\hat{f}_t)$ is the entropy of $\hat{f}_t$, and $D_{\text{sem}}$ is the semantic distortion term. 3. **Overall loss function**: \[ L = L_{\text{RD - sem}}+L_{\text{percep}}+L_{\text{GAN}} \] where $L_{\text{percep}}$ is the perceptual loss and $L_{\text{GAN}}$ is the Generative Adversarial Network (GAN) loss. These formulas ensure that Free - VSC achieves efficient compression while maintaining video semantics.