Abstract:Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing video data management systems (VDBMSs) when dealing with complex video queries. Specifically, existing systems rely on predefined modules to perform each subtask, and these modules cannot cover all possible query requirements. In particular, for queries of compositional events, these systems have difficulty in identifying and handling complex spatio - temporal relationships, object properties, and their interactions. To solve these problems, the author proposes a new self - enhancing video data management system - **VOCAL - UDF**. The main objectives of VOCAL - UDF are: 1. **Support compositional queries without predefined modules**: VOCAL - UDF can automatically identify and construct missing modules and encapsulate these modules as user - defined functions (UDFs), thereby expanding its query capabilities. 2. **Generate new UDFs using large - language models (LLMs)**: Automatically generate new UDFs through LLMs to handle various semantic concepts, including procedural UDFs and compact - model UDFs. 3. **Handle the ambiguity of user intentions**: Select the best UDF implementation that best matches the user's intention through active learning techniques. 4. **A unified UDF model**: Propose a unified UDF model that enables LLMs to generate structured UDFs, simplifying the compilation process. ### Main contributions - Propose a unified UDF scheme for objects, relationships, and properties in videos. - Propose a self - construction framework that uses LLMs to automatically expand the query capabilities of VOCAL - UDF, which can parse natural - language queries and transform unseen semantic concepts into procedural or compact - model UDFs. - Develop a method to effectively manage the ambiguity of semantic concepts and the generation errors of LLMs, generate diverse candidate UDFs, and efficiently determine the implementation that best matches the user's intention. - Evaluate VOCAL - UDF on video datasets in three different domains, demonstrating its significant improvement in F1 score, especially in the aspects of automatic selection, implementation, and execution of automatically generated UDFs. Through these innovations, VOCAL - UDF significantly improves the ability to handle complex video queries, especially in cases where fine - grained object classes and subjective concepts need to be identified and processed.

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions [Technical Report]

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

Retrieval-based Video Language Model for Efficient Long Video Question Answering

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Deep Video Understanding with Video-Language Model

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

Understanding Long Videos with Multimodal Language Models

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video Anomaly Detection and Explanation via Large Language Models