Abstract:Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to retain the key spatio - temporal details in the video while improving computational efficiency without specific fine - tuning in the zero - shot video understanding task. Traditional methods usually rely on a large amount of labeled data and computational resources for fine - tuning to capture the subtle spatio - temporal features in the video, but this method is costly. While completely untrained methods are efficient, they often fail to well preserve context - rich features when dealing with complex video content. For this reason, the authors propose DYTO (DYnamic TOken merging), a novel dynamic token merging framework for zero - shot video understanding. DYTO retains key scene details by adaptively optimizing token efficiency, combines hierarchical frame selection and binary token merging strategies, dynamically clusters key frames and selectively compresses token sequences, thereby achieving a balance between computational efficiency and semantic richness. Specifically, DYTO aims to solve the following problems: 1. **Retain key spatio - temporal information**: Ensure that important spatio - temporal details are not lost while reducing the number of tokens. 2. **Improve computational efficiency**: By dynamically adjusting the token merging ratio, redundant tokens are reduced, thereby reducing the computational cost. 3. **Adapt to videos of different lengths**: Regardless of the video length, key events can be effectively captured and high understanding accuracy can be maintained. ### Core contributions of the paper 1. **Propose a new hierarchical binary merging strategy**: By dynamically selecting key frames and performing adaptive token merging, spatio - temporal fidelity is optimized and finer - grained feature retention is achieved. 2. **Demonstrate superior performance in multiple benchmark tests**: DYTO performs excellently in a variety of benchmark tests, not only outperforming existing methods in understanding ability but also having an advantage in computational efficiency. ### Method overview The main steps of DYTO include: - **Coarse - grained hierarchical clustering**: Hierarchically cluster video frames, divide the video into multiple clusters, and select key frames from them. - **Fine - grained dynamic binary merging**: Dynamically perform binary merging on the frames within each cluster, minimize the number of visual tokens, and retain more context information at the same time. Through these methods, DYTO can achieve better performance in zero - shot video understanding tasks, especially in tasks requiring detailed spatio - temporal and context understanding.

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

VidToMe: Video Token Merging for Zero-Shot Video Editing

Video Token Merging for Long-form Video Understanding

Motion Guided Token Compression for Efficient Masked Video Modeling

Principles of Visual Tokens for Efficient Video Understanding

Training-Free Acceleration of ViTs with Delayed Spatial Merging

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Token Merging: Your ViT But Faster

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation