Abstract:Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method. Captions are publicly released at <a class="link-external link-https" href="https://github.com/Roc-Ng/VAR" rel="external noopener nofollow">this https URL</a>.

Skimming and Scanning for Efficient Action Recognition in Untrimmed Videos

Skimming and Scanning for Untrimmed Video Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Agent-based Video Trimming

Annotation-Efficient Untrimmed Video Action Recognition

TFRS: A task-level feature rectification and separation method for few-shot video action recognition

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

Recognizing Video Activities in the Wild Via View-to-Scene Joint Learning

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

View While Moving: Efficient Video Recognition in Long-untrimmed Videos

Efficient Video Action Detection with Token Dropout and Context Refinement.

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Deep Learning-Based Action Detection in Untrimmed Videos: A Survey

Action Machine: Rethinking Action Recognition in Trimmed Videos

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Watching a Small Portion Could Be As Good As Watching All: Towards Efficient Video Classification.

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Video Action Recognition with Attentive Semantic Units