Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang,Zhuokai Zhao,Zhaorun Chen,Zenghui Ding,Xianjun Yang,Yining Sun
2024-11-22
Abstract:Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to retain the key spatio - temporal details in the video while improving computational efficiency without specific fine - tuning in the zero - shot video understanding task. Traditional methods usually rely on a large amount of labeled data and computational resources for fine - tuning to capture the subtle spatio - temporal features in the video, but this method is costly. While completely untrained methods are efficient, they often fail to well preserve context - rich features when dealing with complex video content. For this reason, the authors propose DYTO (DYnamic TOken merging), a novel dynamic token merging framework for zero - shot video understanding. DYTO retains key scene details by adaptively optimizing token efficiency, combines hierarchical frame selection and binary token merging strategies, dynamically clusters key frames and selectively compresses token sequences, thereby achieving a balance between computational efficiency and semantic richness. Specifically, DYTO aims to solve the following problems: 1. **Retain key spatio - temporal information**: Ensure that important spatio - temporal details are not lost while reducing the number of tokens. 2. **Improve computational efficiency**: By dynamically adjusting the token merging ratio, redundant tokens are reduced, thereby reducing the computational cost. 3. **Adapt to videos of different lengths**: Regardless of the video length, key events can be effectively captured and high understanding accuracy can be maintained. ### Core contributions of the paper 1. **Propose a new hierarchical binary merging strategy**: By dynamically selecting key frames and performing adaptive token merging, spatio - temporal fidelity is optimized and finer - grained feature retention is achieved. 2. **Demonstrate superior performance in multiple benchmark tests**: DYTO performs excellently in a variety of benchmark tests, not only outperforming existing methods in understanding ability but also having an advantage in computational efficiency. ### Method overview The main steps of DYTO include: - **Coarse - grained hierarchical clustering**: Hierarchically cluster video frames, divide the video into multiple clusters, and select key frames from them. - **Fine - grained dynamic binary merging**: Dynamically perform binary merging on the frames within each cluster, minimize the number of visual tokens, and retain more context information at the same time. Through these methods, DYTO can achieve better performance in zero - shot video understanding tasks, especially in tasks requiring detailed spatio - temporal and context understanding.