VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

Ahmad Mahmood,Ashmal Vayani,Muzammal Naseer,Salman Khan,Fahad Shahbaz Khan

2024-03-25

Abstract:Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. By presenting LLMs with pairs of instructions and their corresponding high-level programs, we harness their contextual learning capabilities to generate executable visual programs for video understanding. To enhance program's accuracy and robustness, we implement two important strategies. Firstly, we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. Secondly, taking motivation from recent works on self refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation and multi-video QA illustrate the efficacy of these enhancements in improving the performance of visual programming approaches for video tasks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a general video understanding and reasoning framework called VURF, which utilizes the power of large-scale language models (LLMs) to handle video tasks. VURF extends the application of LLMs in the field of video by decomposing complex video understanding tasks into subtasks that can be solved by dedicated computer vision models. To improve program accuracy and robustness, two strategies are implemented in the paper: firstly, the use of GPT-3.5's feedback generation method to correct errors that do not support functions; secondly, the adoption of self-improvement methods inspired by LLM outputs to iteratively improve contextual examples, making the initial outputs closer to what LLM would generate without structural constraints. Experimental results show that these enhancement measures improve the performance of visual programming approaches in video tasks, including video question answering, video prediction, pose estimation, and multi-video question answering, among others.

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

Look, Remember and Reason: Grounded reasoning in videos with language models

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Reframe Anything: LLM Agent for Open World Video Reframing

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Smart Vision-Language Reasoners

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

ViLLa: Video Reasoning Segmentation with Large Language Model

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception