Shan Yu,Zhenting Zhu,Yu Chen,Hanchen Xu,Pengzhan Zhao,Yang Wang,Arthi Padmanabhan,Hugo Latapie,Harry Xu
Abstract:Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper proposes a solution to the problem of complex queries in modern video analysis. With the popularity of surveillance cameras and online video platforms, a large amount of video data emerges, and intelligent analysis of these data is essential for various applications such as enhanced security in smart cities, optimized traffic management, and autonomous driving. Video querying is the core of video analysis, where users search for objects or events of specific interests, such as traffic violations or suspicious activities.
The current methods for handling video queries mainly include manually constructing pipelines, using SQL-like languages, and multi-modal language models. These methods have their limitations, such as the high labor intensity and error-prone nature of manually constructing pipelines, SQL-like frameworks not being adept at handling object-based video queries, and multi-modal language models being unsuitable for real-time or time-sensitive applications.
The main observation of the paper is that video queries focus on objects in the videos (such as people, animals, vehicles, etc.) and their spatial and temporal interactions, and existing technologies lack object-based abstractions, making it difficult to describe and optimize complex queries. The key insight of the paper is that video objects share essential similarities with objects in traditional object-oriented programming languages. Therefore, creating a query framework based on video objects can simplify the writing of complex queries and allow optimizations at the object level to improve query performance.
To this end, the paper introduces VQPy, a variant of Python, designed specifically for expressing video objects and their interactions in a user-friendly manner. The VQPy front-end uses an object-oriented approach to represent video objects and interactions, while the back-end is based on an object-centric data model that automatically constructs and optimizes pipelines. Furthermore, VQPy provides an extensible optimization framework that can easily integrate various query optimizations.
The paper demonstrates the effectiveness of VQPy through evaluations of 14 queries and 5 real-world surveillance video stream datasets, showing an average speed improvement of 10 times compared to existing systems without sacrificing accuracy. VQPy is open-source and has been productized in Cisco's DeepVision framework.