Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar,Yongqin Xian,Alessio Tonioni,Andrew Zisserman,Federico Tombari
2024-08-19
Abstract:In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: how to effectively process and understand long - video sequences, especially when these videos need to be combined with text conditions. Specifically, the author proposes a module named Text - Conditioned Resampler (TCR), aiming to select relevant visual features from long videos under given text conditions by combining pre - trained visual encoders and large - language models (LLMs), and pass them to LLMs to generate text responses. ### Main Problems and Challenges 1. **Computational Resource Limitations** - Existing visual - language models (VLMs) face computational resource limitations when processing videos. In particular, as the video length increases, the memory requirement grows quadratically, making these models impractical in practical applications. 2. **Understanding of Long - Time Videos** - Long - time videos contain more information, but existing methods can usually only process a small number of frames (such as 4 to 32 frames), which limits the comprehensive understanding of video content. 3. **Task Adaptability** - Different tasks (such as video question answering, action prediction, etc.) have different requirements for the time span and frame density of videos, and existing methods are difficult to flexibly adapt to these changes. ### Solutions To address the above challenges, the author proposes the TCR module, whose main features include: - **Lightweight Design** - TCR can process video sequences of more than 100 frames while maintaining low computational complexity by using the cross - attention mechanism instead of full self - attention. - **Text - Conditioning** - TCR can select the most relevant visual features according to the given text conditions, thereby improving task - specific performance. - **Modular Integration** - TCR can be integrated as a plug - in module into existing visual - language models, such as BLIP2, thereby expanding their ability to process long videos. ### Experimental Verification The author verifies the effectiveness of TCR through multiple experiments, especially in the following aspects: - **Video Question Answering (NextQA)** - On the NextQA dataset, TCR significantly improves the accuracy of the model, especially when dealing with longer videos. - **Long - Time Video Understanding (EgoSchema)** - On the EgoSchema dataset, TCR shows better performance, indicating its advantages in dealing with long - time videos. - **Future Action Prediction (EGO4D - LTA)** - In the long - term action prediction task, TCR significantly improves the prediction accuracy by processing longer video sequences. In summary, the main contribution of this paper is to propose an effective solution, enabling visual - language models to maintain high efficiency and accuracy when processing long videos, thus promoting the further development of the video understanding field.