HierVL: Learning Hierarchical Video-Language Embeddings

Kumar Ashutosh,Rohit Girdhar,Lorenzo Torresani,Kristen Grauman
2023-06-08
Abstract:Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address a key issue in video understanding: existing video-language embedding methods can only capture the association between short-term video clips and text, but fail to effectively capture long-term contextual information and activity intentions. Specifically: 1. **Short-term and Long-term Associations**: Traditional video-language embedding methods typically learn representations by matching a few seconds of video clips with their corresponding text descriptions. This approach mainly focuses on "what is happening" but ignores the broader contextual information and the purpose of the activity (i.e., "why it is happening"). 2. **Limitations in Activity Understanding**: Understanding activities in videos lags behind understanding objects in images because activities span multiple video frames and their interpretation relies on a larger context, namely human intentions. Therefore, there is a natural hierarchical structure of information in videos, from short-term "literal actions" (e.g., reaching for the stove) to long-term "goals" (e.g., cooking). 3. **Multimodal Representation Learning**: To capture this hierarchical structure, the authors propose a new hierarchical video-language embedding model (HierVL) that can simultaneously capture short-term actions and long-term intentions. By introducing hierarchical contrastive learning objectives, HierVL can achieve text-visual alignment at both the clip level and the video level. ### Solution 1. **Hierarchical Contrastive Learning**: HierVL uses a two-layer contrastive learning objective. The top layer (parent layer) encourages aggregated video clips to be close to the overall text summary, while the bottom layer (child layer) trains individual clips to be similar to their corresponding descriptions. This way, the model can capture not only short-term immediate actions but also understand how these actions contribute to long-term goals. 2. **Data Utilization**: The training data includes timestamped text descriptions and high-level text summaries from the Ego4D dataset. Ego4D contains a large number of first-person perspective videos of daily activities, each with detailed frame-by-frame descriptions and overall summaries. 3. **Feature Aggregation**: To efficiently capture long-term features, the authors use a strategy of aggregating short-term features. Specifically, they use self-attention mechanisms or average pooling to aggregate short-term features, thereby generating long-term video and text representations. 4. **Joint Training**: The model adopts a joint training strategy, first training short-term visual and text pairs, and then training long-term features. This strategy ensures that the model optimizes representations at both levels, avoiding the problem of catastrophic forgetting. ### Experimental Results 1. **Pre-training Evaluation**: Pre-training evaluation on the Ego4D dataset shows that HierVL significantly outperforms the baseline model EgoVLP on long-term tasks (such as SummaryMCQ and ShuffleMCQ). Particularly, on the SummaryMCQ task, HierVL-SA's performance improved by over 6%. 2. **Downstream Tasks**: The pre-trained representations perform excellently on multiple downstream tasks, including zero-shot and fine-tuning settings on datasets like EPIC-KITCHENS-100, Charades-Ego, and HowTo100M. HierVL achieves state-of-the-art performance on tasks such as Ego4D long-term anticipation (LTA), Charades-Ego action recognition, and EPIC-KITCHENS-100 multi-instance retrieval. In summary, by introducing a hierarchical video-language embedding method, HierVL successfully addresses the shortcomings of existing methods in capturing long-term context and activity intentions, providing a new perspective for video understanding.