PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Roei Herzig,Ofir Abramovich,Elad Ben-Avraham,Assaf Arbelle,Leonid Karlinsky,Ariel Shamir,Trevor Darrell,Amir Globerson

2023-12-06

Abstract:Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy" approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: \url{<a class="link-external link-https" href="https://ofir1080.github.io/PromptonomyViT" rel="external noopener nofollow">this https URL</a>}

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges faced by video understanding models in performing action recognition and other related tasks on real-world video data, especially considering the significant effort and cost required to collect large-scale annotated real video datasets. Specifically, the paper proposes a method to improve video understanding models by utilizing synthetic scene data. By introducing a multi-task prompt learning approach, the paper aims to capture information sharing between different synthetic tasks and combine it with information sharing for real video downstream tasks without applying any domain gap techniques. The main contributions of the paper are as follows: 1. Proposing a new method that leverages multiple synthetic generated data labels to improve video understanding models. 2. Introducing the special concept of "multi-task prompts" to capture task-relevant information through task supervision and interact with prompts from other tasks and downstream video tasks. 3. Demonstrating improved performance on 5 video understanding benchmarks, including composite action recognition, few-shot action recognition, and spatiotemporal action detection tasks, proving the effectiveness of the proposed method.

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

Learning Expressive Prompting With Residuals for Vision Transformers

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Enhancing Video Transformers for Action Understanding with VLM-aided Training

Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT

Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks

Visual Prompt Multi-Modal Tracking

Multitask Vision-Language Prompt Tuning

VIMA: Robot Manipulation with Multimodal Prompts

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer

Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts