Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar,Xiaohan Wang,Yonatan Bitton,Idan Szpektor,Serena Yeung-Levy

2024-07-09

Abstract:The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly explores how to use Large-scale Visual Language Models (LVLMs) for fine-tuning video instructions. Existing video instruction fine-tuning datasets lack diversity and mainly consist of descriptive questions and answers generated by large language models. Although there are many video datasets with diverse labels and supervision, integrating them into LVLMs is not easy. Therefore, the paper proposes the Video Self-Training with Augmented Reasoning (Video-STaR) method, which is the first video self-training method that allows fine-tuning video instructions using any labeled video dataset. Video-STaR improves the model's ability to understand videos and adapt to new tasks through iterative instruction generation, label verification, and instruction fine-tuning. In the generation stage, the LVLM is prompted to generate answers containing video labels. In the verification stage, only answers containing the correct video labels are retained. In the fine-tuning stage, the model is retrained on these answers. Through this approach, Video-STaR utilizes existing video labels as weak supervision to improve video instruction fine-tuning. Experimental results show that the enhanced LVLM with Video-STaR achieves significant performance improvements in general video question answering (VQA) and downstream tasks such as action recognition and action quality assessment. Specifically, TempCompass's performance improved by 10%, Kinetics700-QA's accuracy improved by 20%, and FineDiving's action quality assessment improved by 15%. Overall, this paper aims to address the limitations of LVLMs in video understanding and adaptation to new tasks. It achieves effective fine-tuning using various existing video datasets through the Video-STaR method, thereby improving the model's performance and adaptability.

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Video Instruction Tuning With Synthetic Data

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

V-STaR: Training Verifiers for Self-Taught Reasoners

SVIT: Scaling up Visual Instruction Tuning

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Vision-Language Instruction Tuning: A Review and Analysis

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Aligning Large Multi-Modal Model with Robust Instruction Tuning

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Distilling Vision-Language Models on Millions of Videos

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

Reconstructive Visual Instruction Tuning

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Calibrated Self-Rewarding Vision Language Models