Abstract:Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve lie in two major limitations of current video - language models (VLMs) when processing egocentric videos: 1. **Visual reasoning does not conform to natural perception**: When performing visual reasoning, existing VLMs do not follow the natural perception mode of humans in the first - person perspective, resulting in a lack of interpretability. 2. **Insufficient capture of fine - grained relationships**: The learning process is limited by the inability to effectively capture the inherent fine - grained relationships between video and language modalities. To solve these problems, the authors propose HENASY (Hierarchical ENtities ASsemblY), a new framework for interpretable first - person video - language models. Specifically, HENASY improves existing methods in the following ways: - **Dynamic scene entity assembly**: A spatio - temporal token grouping mechanism is introduced to explicitly assemble scene entities evolving over time and model the relationships between them to generate video representations. - **Multi - granularity contrastive loss**: A set of multi - granularity contrastive losses is explored to promote entity - level understanding. This includes three alignment types: video - narrative, noun - entity, and verb - entity alignment. - **Strong interpretability**: By combining visual grounding with free - form text queries, HENASY has strong interpretability. ### Specific problem description Existing VLMs rely on instance - level alignment of video and language modalities and have the following problems: - **Visual reasoning is not intuitive**: When humans perceive the surrounding environment in the first - person perspective, they form an overall understanding by combining multiple small parts. However, existing models cannot effectively simulate this combinatorial perception mode. - **Lack of fine - grained relationships**: Videos contain complex dynamic interactions, and simple instance - level alignment cannot capture these fine - grained relationships, especially the entity and action information conveyed by nouns and verbs. ### HENASY's solutions HENASY solves the above problems through the following components: 1. **Local Entity Encoder**: Based on a hierarchical Transformer encoder, it learns to assemble dynamic scene entities from video clips through the proposed spatio - temporal token grouping mechanism. 2. **Global Encoder**: A pre - trained video representation module that perceives the overall features of the input video. 3. **Entity - Aware Decoder**: It models the internal interactions between scene entities and their relationships with global features, thereby enriching the extraction of entity - level video representations. In addition, HENASY also introduces multi - granularity contrastive losses, including video - narrative, noun - entity, and verb - entity alignment, to optimize the learning of entity - level and video - level representations. Through these improvements, HENASY not only performs well on multiple benchmark tasks but also has strong interpretability and can provide visual explanations through dynamic saliency maps. ### Summary of mathematical formulas - **EgoNCE loss**: \[ L_{\text{v2t}}^{\text{ego}}=\frac{1}{eB}\sum_{i\in eB}\log\frac{\exp(\hat{v}_i^T\hat{t}_p / \tau)}{\sum_{n\in B}\exp(\hat{v}_i^T\hat{t}_n / \tau)+\exp(\hat{v}_i^T\hat{t}_{n'}/\tau)} \] where $\tau$ represents the temperature parameter. - **Noun - entity contrastive loss (NEC)**: \[ L_{\text{NEC}} = -\frac{1}{N_n}\sum_{p = 1}^{N_n}\log\frac{\exp(e_p^T n_p / \tau)}{\sum_{j\in D}\exp(e_p^T n_j' / \tau)} \] - **Verb - entity contrastive loss (VEC)**: \[

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Video-Language Models as Flexible Social and Physical Reasoners

HierVL: Learning Hierarchical Video-Language Embeddings

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

SEAL: Semantic Attention Learning for Long Video Representation

Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Verbs in Action: Improving verb understanding in video-language models

Compositional Entailment Learning for Hyperbolic Vision-Language Models

Composition Vision-Language Understanding via Segment and Depth Anything Model

Long Context Transfer from Language to Vision

Generalizable Entity Grounding via Assistance of Large Language Model

Semantic Composition in Visually Grounded Language Models

Object-centric Video Representation for Long-term Action Anticipation

Jointly Modeling Embedding and Translation to Bridge Video and Language

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Embodied Language Grounding with 3D Visual Feature Representations

Natural Language Inference Improves Compositionality in Vision-Language Models