Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

Enhao Zhang,Nicole Sullivan,Brandon Haynes,Ranjay Krishna,Magdalena Balazinska
2024-08-05
Abstract:Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing video data management systems (VDBMSs) when dealing with complex video queries. Specifically, existing systems rely on predefined modules to perform each subtask, and these modules cannot cover all possible query requirements. In particular, for queries of compositional events, these systems have difficulty in identifying and handling complex spatio - temporal relationships, object properties, and their interactions. To solve these problems, the author proposes a new self - enhancing video data management system - **VOCAL - UDF**. The main objectives of VOCAL - UDF are: 1. **Support compositional queries without predefined modules**: VOCAL - UDF can automatically identify and construct missing modules and encapsulate these modules as user - defined functions (UDFs), thereby expanding its query capabilities. 2. **Generate new UDFs using large - language models (LLMs)**: Automatically generate new UDFs through LLMs to handle various semantic concepts, including procedural UDFs and compact - model UDFs. 3. **Handle the ambiguity of user intentions**: Select the best UDF implementation that best matches the user's intention through active learning techniques. 4. **A unified UDF model**: Propose a unified UDF model that enables LLMs to generate structured UDFs, simplifying the compilation process. ### Main contributions - Propose a unified UDF scheme for objects, relationships, and properties in videos. - Propose a self - construction framework that uses LLMs to automatically expand the query capabilities of VOCAL - UDF, which can parse natural - language queries and transform unseen semantic concepts into procedural or compact - model UDFs. - Develop a method to effectively manage the ambiguity of semantic concepts and the generation errors of LLMs, generate diverse candidate UDFs, and efficiently determine the implementation that best matches the user's intention. - Evaluate VOCAL - UDF on video datasets in three different domains, demonstrating its significant improvement in F1 score, especially in the aspects of automatic selection, implementation, and execution of automatically generated UDFs. Through these innovations, VOCAL - UDF significantly improves the ability to handle complex video queries, especially in cases where fine - grained object classes and subjective concepts need to be identified and processed.