A Coding Framework and Benchmark towards Low-Bitrate Video Understanding

Yuan Tian,Guo Lu,Yichao Yan,Guangtao Zhai,Li Chen,Zhiyong Gao
DOI: https://doi.org/10.1109/tpami.2024.3367879
IF: 23.6
2024-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Video compression is indispensable to most video analysis systems. Despite saving the transportation bandwidth, it also deteriorates downstream video understanding tasks, especially at low-bitrate settings. To systematically investigate this problem, we first thoroughly review the previous methods, revealing that three principles, i.e., task-decoupled, label-free, and data-emerged semantic prior, are critical to a machine-friendly coding framework but are not fully satisfied so far. In this paper, we propose a traditional-neural mixed coding framework that simultaneously fulfills all these principles, by taking advantage of both traditional codecs and neural networks (NNs). On one hand, the traditional codecs can efficiently encode the pixel signal of videos but may distort the semantic information. On the other hand, highly non-linear NNs are proficient in condensing video semantics into a compact representation. The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved w.r.t. the coding procedure, which is spontaneously learned from unlabeled data in a self-supervised manner. The videos collaboratively decoded from two streams (codec and NN) are of rich semantics, as well as visually photo-realistic, empirically boosting several mainstream downstream video analysis task performances without any post-adaptation procedure. Furthermore, by introducing the attention mechanism and adaptive modeling scheme, the video semantic modeling ability of our approach is further enhanced. Fianlly, we build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach. All codes, data, and models will be open-sourced for facilitating future research.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the impact of video compression on the performance of video understanding tasks at low bit rates. Specifically: 1. **Impact of Video Compression on Downstream Tasks**: Although video compression can save transmission bandwidth, it severely damages the performance of downstream video understanding tasks, especially at low bit rate settings. For example, even with the advanced VVC codec, the recognition accuracy of the SlowFast model on the Kinetics dataset drops by 27% at 0.02bpp (bits per pixel). 2. **Combination of Traditional Codecs and Neural Networks**: Traditional codecs can efficiently encode the pixel signals of videos but may distort semantic information. Highly nonlinear neural networks, on the other hand, excel at compressing video semantics into compact representations. Therefore, the paper proposes a hybrid traditional-neural codec framework that aims to leverage the advantages of both to preserve semantic information at low bit rates. 3. **Building a Unified Video Coding Framework**: The paper proposes a framework called VCS (Video Coding for Semantics), which spontaneously learns semantic representations from unlabeled data through self-supervised learning, achieving task decoupling, label-free, and data-spontaneous semantic priors. This enables VCS to support various machine intelligence tasks without the need for task-specific adaptive adjustments. ### Main Contributions 1. **VCS Framework**: VCS introduces a neural stream that adds data-spontaneous semantic priors on top of traditional codecs, fully leveraging the advantages of traditional codecs. VCS can be easily deployed to various downstream video analysis tasks without any task-specific or data-specific adaptive processing. 2. **Self-Supervised Optimization**: VCS is optimized through a bottleneck-based contrastive learning objective, which helps retain video semantics and encourages the discarding of irrelevant semantic information. 3. **Network Architecture Design**: The network architecture of VCS is carefully designed, adopting adaptive and dynamic schemes to enhance its semantic modeling capabilities. 4. **Extensive Task and Dataset Evaluation**: The paper evaluates VCS on three popular video understanding tasks (action recognition, multi-object tracking, and video object segmentation) using eight large-scale datasets, demonstrating VCS's strong performance across various tasks and datasets. To facilitate future research, the authors also constructed a systematic coding benchmark, including the re-implementation and evaluation of three traditional codecs, two learnable codecs, and four VCM methods. In summary, this paper addresses the impact of video compression on the performance of video understanding tasks at low bit rates by proposing the VCS framework and validates its effectiveness and superiority through a series of experiments.