HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Khoa Vo,Thinh Phan,Kashu Yamazaki,Minh Tran,Ngan Le
2024-11-02
Abstract:Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve lie in two major limitations of current video - language models (VLMs) when processing egocentric videos: 1. **Visual reasoning does not conform to natural perception**: When performing visual reasoning, existing VLMs do not follow the natural perception mode of humans in the first - person perspective, resulting in a lack of interpretability. 2. **Insufficient capture of fine - grained relationships**: The learning process is limited by the inability to effectively capture the inherent fine - grained relationships between video and language modalities. To solve these problems, the authors propose HENASY (Hierarchical ENtities ASsemblY), a new framework for interpretable first - person video - language models. Specifically, HENASY improves existing methods in the following ways: - **Dynamic scene entity assembly**: A spatio - temporal token grouping mechanism is introduced to explicitly assemble scene entities evolving over time and model the relationships between them to generate video representations. - **Multi - granularity contrastive loss**: A set of multi - granularity contrastive losses is explored to promote entity - level understanding. This includes three alignment types: video - narrative, noun - entity, and verb - entity alignment. - **Strong interpretability**: By combining visual grounding with free - form text queries, HENASY has strong interpretability. ### Specific problem description Existing VLMs rely on instance - level alignment of video and language modalities and have the following problems: - **Visual reasoning is not intuitive**: When humans perceive the surrounding environment in the first - person perspective, they form an overall understanding by combining multiple small parts. However, existing models cannot effectively simulate this combinatorial perception mode. - **Lack of fine - grained relationships**: Videos contain complex dynamic interactions, and simple instance - level alignment cannot capture these fine - grained relationships, especially the entity and action information conveyed by nouns and verbs. ### HENASY's solutions HENASY solves the above problems through the following components: 1. **Local Entity Encoder**: Based on a hierarchical Transformer encoder, it learns to assemble dynamic scene entities from video clips through the proposed spatio - temporal token grouping mechanism. 2. **Global Encoder**: A pre - trained video representation module that perceives the overall features of the input video. 3. **Entity - Aware Decoder**: It models the internal interactions between scene entities and their relationships with global features, thereby enriching the extraction of entity - level video representations. In addition, HENASY also introduces multi - granularity contrastive losses, including video - narrative, noun - entity, and verb - entity alignment, to optimize the learning of entity - level and video - level representations. Through these improvements, HENASY not only performs well on multiple benchmark tasks but also has strong interpretability and can provide visual explanations through dynamic saliency maps. ### Summary of mathematical formulas - **EgoNCE loss**: \[ L_{\text{v2t}}^{\text{ego}}=\frac{1}{eB}\sum_{i\in eB}\log\frac{\exp(\hat{v}_i^T\hat{t}_p / \tau)}{\sum_{n\in B}\exp(\hat{v}_i^T\hat{t}_n / \tau)+\exp(\hat{v}_i^T\hat{t}_{n'}/\tau)} \] where $\tau$ represents the temperature parameter. - **Noun - entity contrastive loss (NEC)**: \[ L_{\text{NEC}} = -\frac{1}{N_n}\sum_{p = 1}^{N_n}\log\frac{\exp(e_p^T n_p / \tau)}{\sum_{j\in D}\exp(e_p^T n_j' / \tau)} \] - **Verb - entity contrastive loss (VEC)**: \[