Effective Video Abnormal Event Detection by Learning A Consistency-Aware High-Level Feature Extractor

Guang Yu,Siqi Wang,Zhiping Cai,Xinwang Liu,Chengkun Wu
DOI: https://doi.org/10.1145/3503161.3547944
2022-01-01
Abstract:With pure normal training videos, video abnormal event detection (VAD) aims to build a normality model, and then detect abnormal events that deviate from this model. Despite of some progress, existing VAD methods typically train the normality model by a low-level learning objective (e.g. pixel-wise reconstruction/prediction), which often overlooks the high-level semantics in videos. To better exploit high-level semantics for VAD, we propose a novel paradigm that performs VAD by learning a Consistency-Aware high-level Feature Extractor (CAFE). Specifically, with a pre-trained deep neural network (DNN) as teacher network, we first feed raw video events into the teacher network and extract the outputs of multiple hidden layers as their high-level features, which contain rich high-level semantics. Guided by high-level features extracted from normal training videos, we train a student network to be the high-level feature extractor of normal events, so as to explicitly consider high-level semantics in training. For inference, a video event can be viewed as normal if the student extractor produces similar high-level features to the teacher network. Second, based on the fact that consecutive video frames usually enjoy minor differences, we propose a consistency-aware scheme that requires high-level features extracted from neighboring frames to be consistent. Our consistency-aware scheme not only encourages the student extractor to ignore low-level differences and capture more high-level semantics, but also enables better anomaly scoring. Last, we also design a generic framework that can bridge high-level and low-level learning in VAD to further ameliorate VAD performance. By flexibly embedding one or more low-level learning objectives into CAFE, the framework makes it possible to combine the strengths of both high-level and low-level learning. The proposed method attains state-of-the-art results on commonly-used benchmark datasets.
What problem does this paper attempt to address?