Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Mohamed Afham,Satya Narayan Shukla,Omid Poursaeed,Pengchuan Zhang,Ashish Shah,Sernam Lim
2023-09-21
Abstract:While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to address the sampling problem in long video understanding. Specifically, most existing video understanding models are designed for short video clips (usually 5 to 10 seconds), facing computational and memory bottlenecks when dealing with videos that are several minutes long in practice. The currently common method is to evenly divide the long video into fixed-length segments, process each segment separately, and then aggregate the results. However, this method ignores the intrinsic characteristics of long videos, as fixed-length segments are often redundant or insufficiently informative. To solve this problem, the authors propose a task-agnostic, adaptive, and unsupervised frame sampling mechanism based on Kernel Temporal Segmentation (KTS). This method overcomes the shortcomings of traditional methods by decomposing the video into semantically consistent segments and uniformly sampling frames from them to construct input tokens. Experimental results show that this method significantly outperforms existing methods in long video classification and temporal action localization tasks, achieving the current state-of-the-art performance levels.