Abstract:Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in long - video understanding, specifically including: 1. **High computational complexity**: - Long - videos contain a large number of frames and pixels, which makes the computational cost of processing and analyzing these videos very high. Current hardware may not be able to support the large amount of computation and memory resources required for training or inference on long - videos. 2. **Temporal redundancy**: - In long - videos, the changes of scenes and objects are usually slow, resulting in a large amount of temporally redundant information. These redundant information not only increase the processing difficulty, but also may cause the model to ignore important content during the learning process. 3. **Cross - task generalization ability**: - A powerful representation method needs to be applicable to various downstream tasks, from detail - finding (such as answering specific factual questions) to high - level reasoning (such as understanding complex causal relationships). Existing models often do not perform well in this regard. To solve these problems, the paper proposes **SEAL (SEmantic Attention Learning)**, a new unified representation method specifically for long - video understanding. SEAL addresses the above challenges through the following two main steps: - **Semantic decomposition**: - Decompose long - videos into three different semantic entities: scenes, objects, and actions. These entities are treated as "tokens", reducing the number of frames or pixels that the model needs to process, thereby reducing the computational complexity. - **Attention learning**: - Introduce an attention - learning module, which balances the relevance and diversity of tokens through a subset - selection optimization problem, further reducing temporal redundancy and improving cross - task generalization ability. Through these methods, SEAL can significantly improve processing efficiency and performance while maintaining important information in long - videos. Experimental results show that SEAL significantly outperforms existing methods in multiple long - video understanding tasks and benchmark tests, such as video question - answering (MovieChat - 1K, LVBench) and temporal localization (Ego4D) tasks. ### Formula summary In the SEAL framework, the objective function of the attention - learning module can be expressed as: \[ T^*_{s}=\arg\max_{T_s\subset T_G}F_s(T_s|T_G,q) \] where, - \(T_s\) is a subset selected from the set of all tokens \(T_G\). - \(q\) is the query of the downstream task. - \(F_s(T_s|T_G,q)\) is the objective function, which consists of two parts: - Query relevance \(R(\cdot)\), which measures the relevance between visual tokens and the query. - Token diversity \(S(\cdot)\), which ensures the diversity among the selected tokens. The specific formula is as follows: \[ F_s(T_s|T_G,q)=\alpha\sum_{t_s\in T_s}R(t_s,q)+(1 - \alpha)\sum_{t_i,t_j\in T_s}\frac{1}{S(t_i,t_j)} \] Here, \(\alpha\) is a hyperparameter used to balance query relevance and token diversity. In this way, SEAL can effectively reduce redundant information when processing long - videos and improve the generalization ability and performance of the model.

SEAL: Semantic Attention Learning for Long Video Representation

Encoding and Controlling Global Semantics for Long-form Video Question Answering

Streaming Long Video Understanding with Large Language Models

Towards Neuro-Symbolic Video Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Video Action Recognition with Attentive Semantic Units

Language Repository for Long Video Understanding

Learning Spatial-Semantic Features for Robust Video Object Segmentation

LongVLM: Efficient Long Video Understanding via Large Language Models

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

SEAL: A Large-scale Video Dataset of Multi-grained Spatio-temporally Action Localization

Towards Long-Form Video Understanding

Semantic Modulation Based Residual Network for Temporal Language Queries Grounding in Video.

Attention-based LSTM with Semantic Consistency for Videos Captioning

Video Captioning With Attention-Based LSTM and Semantic Consistency

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

VideoAgent: Long-form Video Understanding with Large Language Model as Agent