Abstract:Video compression is indispensable to most video analysis systems. Despite saving the transportation bandwidth, it also deteriorates downstream video understanding tasks, especially at low-bitrate settings. To systematically investigate this problem, we first thoroughly review the previous methods, revealing that three principles, i.e., task-decoupled, label-free, and data-emerged semantic prior, are critical to a machine-friendly coding framework but are not fully satisfied so far. In this paper, we propose a traditional-neural mixed coding framework that simultaneously fulfills all these principles, by taking advantage of both traditional codecs and neural networks (NNs). On one hand, the traditional codecs can efficiently encode the pixel signal of videos but may distort the semantic information. On the other hand, highly non-linear NNs are proficient in condensing video semantics into a compact representation. The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved w.r.t. the coding procedure, which is spontaneously learned from unlabeled data in a self-supervised manner. The videos collaboratively decoded from two streams (codec and NN) are of rich semantics, as well as visually photo-realistic, empirically boosting several mainstream downstream video analysis task performances without any post-adaptation procedure. Furthermore, by introducing the attention mechanism and adaptive modeling scheme, the video semantic modeling ability of our approach is further enhanced. Fianlly, we build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach. All codes, data, and models will be open-sourced for facilitating future research.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the impact of video compression on the performance of video understanding tasks at low bit rates. Specifically: 1. **Impact of Video Compression on Downstream Tasks**: Although video compression can save transmission bandwidth, it severely damages the performance of downstream video understanding tasks, especially at low bit rate settings. For example, even with the advanced VVC codec, the recognition accuracy of the SlowFast model on the Kinetics dataset drops by 27% at 0.02bpp (bits per pixel). 2. **Combination of Traditional Codecs and Neural Networks**: Traditional codecs can efficiently encode the pixel signals of videos but may distort semantic information. Highly nonlinear neural networks, on the other hand, excel at compressing video semantics into compact representations. Therefore, the paper proposes a hybrid traditional-neural codec framework that aims to leverage the advantages of both to preserve semantic information at low bit rates. 3. **Building a Unified Video Coding Framework**: The paper proposes a framework called VCS (Video Coding for Semantics), which spontaneously learns semantic representations from unlabeled data through self-supervised learning, achieving task decoupling, label-free, and data-spontaneous semantic priors. This enables VCS to support various machine intelligence tasks without the need for task-specific adaptive adjustments. ### Main Contributions 1. **VCS Framework**: VCS introduces a neural stream that adds data-spontaneous semantic priors on top of traditional codecs, fully leveraging the advantages of traditional codecs. VCS can be easily deployed to various downstream video analysis tasks without any task-specific or data-specific adaptive processing. 2. **Self-Supervised Optimization**: VCS is optimized through a bottleneck-based contrastive learning objective, which helps retain video semantics and encourages the discarding of irrelevant semantic information. 3. **Network Architecture Design**: The network architecture of VCS is carefully designed, adopting adaptive and dynamic schemes to enhance its semantic modeling capabilities. 4. **Extensive Task and Dataset Evaluation**: The paper evaluates VCS on three popular video understanding tasks (action recognition, multi-object tracking, and video object segmentation) using eight large-scale datasets, demonstrating VCS's strong performance across various tasks and datasets. To facilitate future research, the authors also constructed a systematic coding benchmark, including the re-implementation and evaluation of three traditional codecs, two learnable codecs, and four VCM methods. In summary, this paper addresses the impact of video compression on the performance of video understanding tasks at low bit rates by proposing the VCS framework and validates its effectiveness and superiority through a series of experiments.

A Coding Framework and Benchmark towards Low-Bitrate Video Understanding

A Coding Framework and Benchmark towards Low-Bitrate Video Understanding

Semantic Neural Rendering-based Video Coding: Towards Ultra-Low Bitrate Video Conferencing

A Neural-network Enhanced Video Coding Framework beyond ECM

VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision

Standard compliant video coding using low complexity, switchable neural wrappers

A Preprocessing Framework for Video Machine Vision under Compression

DeepCoder: A Deep Neural Network Based Video Compression

Beyond VVC: Towards Perceptual Quality Optimized Video Compression Using Multi-Scale Hybrid Approaches.

A Framework of Video Coding for Compressing Near-Duplicate Videos

A Compressive Prior Guided Mask Predictive Coding Approach for Video Analysis.

Task-Driven Semantic Coding via Reinforcement Learning

Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics

Effortless Cross-Platform Video Codec: A Codebook-Based Method

Semantic-Aware Visual Decomposition for Image Coding

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

Towards Next Generation Video Coding: from Neural Network Based Predictive Coding to In-Loop Filtering

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

Neural Video Compression with Feature Modulation

Model-based Low Bit-Rate Video Coding for Resource-Deficient Wireless Visual Communication

Task-Driven Video Compression for Humans and Machines: Framework Design and Optimization