ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan,Matei Zaharia,Volodymyr Mnih,Pieter Abbeel,Aleksandra Faust,Hao Liu
2024-10-11
Abstract:Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed -- more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the inefficiency of traditional video encoding methods when dealing with long video sequences. Specifically, existing methods typically encode videos into a fixed number of tokens, which leads to two main issues: 1. **Information Loss**: If the number of tokens is too small, the encoding becomes too coarse, resulting in the loss of a significant amount of detail. 2. **Waste of Computational Resources**: If the number of tokens is too large, although more details can be preserved, it significantly increases computational costs and memory requirements, especially when processing long videos. To solve these problems, the paper proposes a new method called ElasticTok, which can adaptively encode a different number of tokens based on the complexity of the video content. This method not only represents videos more efficiently but also reduces the consumption of computational resources while maintaining high quality. ElasticTok achieves this by introducing a random masking technique that dynamically adjusts the number of tokens per frame during training and allocates tokens dynamically as needed during inference.