Compositional Video Understanding with Spatiotemporal Structure-based Transformers

Hoyeoung Yun,Minseo Kim,Eun-Sol Kim,Jinwoo Ahn
DOI: https://doi.org/10.1109/CVPR52733.2024.01774
2024-06-16
Computer Vision and Pattern Recognition
Abstract:In this paper, we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips, and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However, most real-world videos are composed of long videos ranging from minutes to hours, therefore, it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos, by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method, we resolve the challenging video task, which is compositional generalization understanding of unseen videos. In experiments, we demonstrate new state-of-the-art performances for two challenging video datasets.
Computer Science
What problem does this paper attempt to address?