Abstract:Multimedia event detection aims to precisely retrieve videos that contain complex semantic events from a large pool. This work addresses this task under a zero-shot setting, where only brief event-specific textural information (such as event names, a few descriptive sentences, etc.) is known yet none positive video example is provided. Mainstream approaches to tackling this task are middle-level semantic concept-based, where meticulously-crafted concept banks (e.g., LSCOM) are adopted. We argue that these concept banks are still inadequate facing video semantic complexity. Existing semantic concepts are essentially first-order, mainly designed for atomic objects, scenes or human actions, etc. This work advocates the utilization of high-order concepts (such as subject-predicate-object triplets or adjective-object). The main contributions are two-fold. First, we harvest a comprehensive albeit compact high-order concept library through distilling information from three large public datasets (MS-COCO, Visual Genome, and Kinetics-600), mainly related to visual relations and human-object interactions. Secondly, zero-shot events are often only briefly and partially described via textual input. The resultant semantic ambiguity makes the pursuit of the most indicative high-order concepts challenging. We thus design a novel query-expanding scheme that enriches ambiguous event-specific keywords by searching over either large common knowledge bases (e.g., WikiHow) or top-ranked webpages retrieved from modern search engines. This way sets up a more faithful connection between zero-shot events and high-order concepts. To our best knowledge, this is the first work that strives for concept-based video search beyond first-order concepts. Extensive experiments have been conducted on several large video benchmarks (TRECVID 2013, TRECVID 2014, and ActivityNet-1.3). The evaluations clearly demonstrate the superiority of our constructed high-order concept library and it- complementariness to existing concepts.

Automatic annotation and retrieval for videos

Text Semantics Based Automatic Summarization for Chinese Videos

Applying Semantic Association To Support Content-Based Video Retrieval

An Efficient Approach Based on Image Pixel and Semantic Features Towards Video Retrieval

Robust Semantic Video Indexing by Harvesting Web Images.

Exploiting Semantic And Visual Context For Effective Video Annotation

An integrated semantic-based approach in concept based video retrieval

Automatic Video Annotation Through Search and Mining

Fast And Accurate Content-Based Semantic Search In 100m Internet Videos

Semi-automatic Video Content Annotation

Semantic Annotation for Complex Video Street Views Based on 2D–3D Multi-Feature Fusion and Aggregated Boosting Decision Forests

ShotTagger: tag location for internet videos

Robust Semantic Concept Detection in Large Video Collections

Semantic Video Search by Exploiting Large-Scale Visual Concepts

A Tree-Based Paradigm for Content-Based Video Retrieval and Management

Video structural description technology for the new generation video surveillance systems

Video diver: generic video indexing with diverse features.

Semantic Concept Learning Through Massive Internet Video Mining

Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

Automatic Image Annotations by Mining Web Image Data