Abstract:Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address a key issue in video understanding: existing video-language embedding methods can only capture the association between short-term video clips and text, but fail to effectively capture long-term contextual information and activity intentions. Specifically: 1. **Short-term and Long-term Associations**: Traditional video-language embedding methods typically learn representations by matching a few seconds of video clips with their corresponding text descriptions. This approach mainly focuses on "what is happening" but ignores the broader contextual information and the purpose of the activity (i.e., "why it is happening"). 2. **Limitations in Activity Understanding**: Understanding activities in videos lags behind understanding objects in images because activities span multiple video frames and their interpretation relies on a larger context, namely human intentions. Therefore, there is a natural hierarchical structure of information in videos, from short-term "literal actions" (e.g., reaching for the stove) to long-term "goals" (e.g., cooking). 3. **Multimodal Representation Learning**: To capture this hierarchical structure, the authors propose a new hierarchical video-language embedding model (HierVL) that can simultaneously capture short-term actions and long-term intentions. By introducing hierarchical contrastive learning objectives, HierVL can achieve text-visual alignment at both the clip level and the video level. ### Solution 1. **Hierarchical Contrastive Learning**: HierVL uses a two-layer contrastive learning objective. The top layer (parent layer) encourages aggregated video clips to be close to the overall text summary, while the bottom layer (child layer) trains individual clips to be similar to their corresponding descriptions. This way, the model can capture not only short-term immediate actions but also understand how these actions contribute to long-term goals. 2. **Data Utilization**: The training data includes timestamped text descriptions and high-level text summaries from the Ego4D dataset. Ego4D contains a large number of first-person perspective videos of daily activities, each with detailed frame-by-frame descriptions and overall summaries. 3. **Feature Aggregation**: To efficiently capture long-term features, the authors use a strategy of aggregating short-term features. Specifically, they use self-attention mechanisms or average pooling to aggregate short-term features, thereby generating long-term video and text representations. 4. **Joint Training**: The model adopts a joint training strategy, first training short-term visual and text pairs, and then training long-term features. This strategy ensures that the model optimizes representations at both levels, avoiding the problem of catastrophic forgetting. ### Experimental Results 1. **Pre-training Evaluation**: Pre-training evaluation on the Ego4D dataset shows that HierVL significantly outperforms the baseline model EgoVLP on long-term tasks (such as SummaryMCQ and ShuffleMCQ). Particularly, on the SummaryMCQ task, HierVL-SA's performance improved by over 6%. 2. **Downstream Tasks**: The pre-trained representations perform excellently on multiple downstream tasks, including zero-shot and fine-tuning settings on datasets like EPIC-KITCHENS-100, Charades-Ego, and HowTo100M. HierVL achieves state-of-the-art performance on tasks such as Ego4D long-term anticipation (LTA), Charades-Ego action recognition, and EPIC-KITCHENS-100 multi-instance retrieval. In summary, by introducing a hierarchical video-language embedding method, HierVL successfully addresses the shortcomings of existing methods in capturing long-term context and activity intentions, providing a new perspective for video understanding.

HierVL: Learning Hierarchical Video-Language Embeddings

Visual-guided Hierarchical Iterative Fusion for Multi-Modal Video Action

TEVL: Trilinear Encoder for Video-language Representation Learning

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

Jointly Modeling Embedding and Translation to Bridge Video and Language

VidLA: Video-Language Alignment at Scale

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding

Learning Hierarchical Embedding for Video Instance Segmentation.

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

LongVLM: Efficient Long Video Understanding via Large Language Models

Video-Language Models as Flexible Social and Physical Reasoners