Compressed Vision for Efficient Video Understanding

Olivia Wiles,Joao Carreira,Iain Barr,Andrew Zisserman,Mateusz Malinowski
DOI: https://doi.org/10.48550/arXiv.2210.02995
2022-10-06
Abstract:Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to efficiently process and understand long - time videos (such as videos lasting several hours) without the need for expensive hardware resources**. Specifically, traditional computer vision research mainly focuses on processing videos on a short - time scale (from a few seconds to dozens of seconds), because processing longer videos requires higher computing resources and storage space. This makes training models very time - consuming and difficult to implement on existing hardware. To solve this problem, the author proposes a framework named "Compressed Vision", which improves efficiency in the following ways: 1. **Neural Compression**: Use a neural network to compress videos instead of traditional compression methods such as JPEG or MPEG. This can significantly reduce the amount of data while maintaining relatively high video quality. 2. **Directly Use the Compressed Video as Input**: Directly input the compressed video into a standard video - understanding network, avoiding the decompression step in traditional methods, thereby improving the efficiency of data transmission, speed, and memory usage. 3. **Augmentations in Compressed Space**: Introduce a small neural network to perform common data - augmentation operations (such as cropping, flipping, etc.) in the compressed feature space. This solves the problem that data augmentation cannot be directly performed in the compressed space while retaining the efficiency advantages brought by compression. Through these improvements, the author shows that their method can train video models more efficiently on popular benchmark datasets (such as Kinetics600 and COIN) and can handle videos lasting several hours, which is difficult to achieve with traditional methods. ### Formula Summary - **Compression Rate Formula**: \[ c_r=\frac{I_T\times I_H\times I_W\times3\times\log_2(256)}{T_T\times T_H\times T_W\times T_C\times\log_2(K)} \] where \(I_T, I_H, I_W\) represent the time, height, and width dimensions of the original video respectively; \(T_T, T_H, T_W\) represent the time, height, and width dimensions of the compressed tensor respectively; \(T_C\) represents the number of codebooks; \(K\) represents the number of codes in the codebook. - **Reconstruction Loss Function**: \[ \mathcal{L}=\|A(X)-c^{-1}(a(c(X), bb))\|_1 \] where \(A(X)\) represents the result of performing a certain augmentation operation on the original video \(X\); \(c(X)\) represents encoding the video \(X\); \(a(\cdot)\) represents the augmentation network; \(c^{-1}(\cdot)\) represents the decoder. ### Summary The main contribution of this paper is proposing a new framework that can efficiently process long - time videos under existing hardware conditions, significantly improving the efficiency and feasibility of video - understanding tasks.