Abstract:Multimedia event detection aims to precisely retrieve videos that contain complex semantic events from a large pool. This work addresses this task under a zero-shot setting, where only brief event-specific textural information (such as event names, a few descriptive sentences, etc.) is known yet none positive video example is provided. Mainstream approaches to tackling this task are middle-level semantic concept-based, where meticulously-crafted concept banks (e.g., LSCOM) are adopted. We argue that these concept banks are still inadequate facing video semantic complexity. Existing semantic concepts are essentially first-order, mainly designed for atomic objects, scenes or human actions, etc. This work advocates the utilization of high-order concepts (such as subject-predicate-object triplets or adjective-object). The main contributions are two-fold. First, we harvest a comprehensive albeit compact high-order concept library through distilling information from three large public datasets (MS-COCO, Visual Genome, and Kinetics-600), mainly related to visual relations and human-object interactions. Secondly, zero-shot events are often only briefly and partially described via textual input. The resultant semantic ambiguity makes the pursuit of the most indicative high-order concepts challenging. We thus design a novel query-expanding scheme that enriches ambiguous event-specific keywords by searching over either large common knowledge bases (e.g., WikiHow) or top-ranked webpages retrieved from modern search engines. This way sets up a more faithful connection between zero-shot events and high-order concepts. To our best knowledge, this is the first work that strives for concept-based video search beyond first-order concepts. Extensive experiments have been conducted on several large video benchmarks (TRECVID 2013, TRECVID 2014, and ActivityNet-1.3). The evaluations clearly demonstrate the superiority of our constructed high-order concept library and it- complementariness to existing concepts.

Detecting Semantic Concepts in Consumer Videos Using Audio.

Semantic Concept Annotation Based on Audio PLSA Model

Robust Semantic Concept Detection in Large Video Collections

VIREO-374 : LSCOM Semantic Concept Detectors Using Local Keypoint Features

Video Semantic Concept Detection Using Multi-Modality Subspace Correlation Propagation

Semantic Video Search by Exploiting Large-Scale Visual Concepts

Exploiting Concept Association to Boost Multimedia Semantic Concept Detection

Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching

Audio-Visual Segmentation with Semantics

Object Segmentation with Audio Context

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Video Data Mining: Semantic Indexing and Event Detection from the Association Perspective

Audio Keywords Generation for Sports Video Analysis

Fast And Accurate Content-Based Semantic Search In 100m Internet Videos

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

Caption-aided Speech Detection in Videos

Multi-modal interview concept detection for rushes exploitation

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

A Fusion Scheme of Visual and Auditory Modalities for Event Detection in Sports Video.

Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection.