Abstract:In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of ``related'' videos. For learning only from an event's textual description, we first identify a general learning framework and then study the impact of different design choices for various stages of this framework. For additionally learning from example videos, when true positive training samples are scarce, we employ an extension of the Support Vector Machine that allows us to exploit ``related'' event videos by automatically introducing different weights for subsets of the videos in the overall training set. Experimental evaluations performed on the large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness of the proposed methods.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in video event detection: 1. **Learning to detect video events only from the text descriptions of events without using any positive video examples**: - The paper explores how to train a video event detector only based on the text descriptions of events (such as titles, free - format text explanations, and visual and audio cues) without any video samples labeled as positive classes. This problem is extremely challenging because traditional supervised learning methods usually rely on a large amount of labeled data. 2. **Learning with a small number of positive training samples and a small number of "related" videos**: - When only a very limited number of positive video samples are available, how to combine these samples with some videos closely related to the target event (but not fully meeting the positive class criteria) for learning. The author proposes an extended support vector machine (Relevance Degree SVM, RDSVM), which can automatically assign different weights to different subsets of videos, so as to make better use of these "related" videos. ### Overview of Solutions - **Learning Only from Text Descriptions**: - A general learning framework is proposed, and the impact of different design choices at each stage of this framework on performance is studied. An event detector is constructed by calculating the Explicit Semantic Analysis (ESA) distance between text descriptions and concepts. - **Learning by Combining a Small Number of Positive Samples and Related Videos**: - Using the RDSVM algorithm, the "related" videos are treated as weighted positive or negative samples, thereby improving the generalization ability of the model when positive samples are scarce. The weight parameters of related samples are optimized through cross - validation to achieve the best classification effect. ### Experimental Verification The paper conducts experiments on the large - scale TRECVID MED 2014 video dataset to verify the effectiveness of the proposed methods. The experimental results show that these methods have significant advantages in handling high - dimensional complex event detection tasks, especially when positive samples are scarce. Through the above two methods, the paper aims to narrow the gap between low - level audiovisual features and semantic - level event definitions, thereby improving the accuracy and robustness of video event detection.

Learning to detect video events from zero or very few video examples

Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and Matching

Dynamic Concept Composition for Zero-Example Event Detection

Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base.

A generic framework for event detection in various video domains.

How Related Exemplars Help Complex Event Detection in Web Videos?

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection.

Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search

Video event recognition using kernel methods with multilevel temporal alignment

Effective video event detection via subspace projection

They Are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers

Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching.

Concepts Not Alone: Exploring Pairwise Relationships for Zero-Shot Video Activity Recognition

Event oriented dictionary learning for complex event detection

Video Event Detection Using Motion Relativity and Feature Selection

High-level Event Recognition in Unconstrained Videos

A Discriminative CNN Video Representation for Event Detection.

Bi-Level Semantic Representation Analysis for Multimedia Event Detection

Video Event Detection Using Motion Relativity and Visual Relatedness.

VideoStory Embeddings Recognize Events when Examples are Scarce