Learning to detect video events from zero or very few video examples

Christos Tzelepis,Damianos Galanopoulos,Vasileios Mezaris,Ioannis Patras
DOI: https://doi.org/10.1016/j.imavis.2015.09.005
2015-11-25
Abstract:In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of ``related'' videos. For learning only from an event's textual description, we first identify a general learning framework and then study the impact of different design choices for various stages of this framework. For additionally learning from example videos, when true positive training samples are scarce, we employ an extension of the Support Vector Machine that allows us to exploit ``related'' event videos by automatically introducing different weights for subsets of the videos in the overall training set. Experimental evaluations performed on the large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness of the proposed methods.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two key problems in video event detection: 1. **Learning to detect video events only from the text descriptions of events without using any positive video examples**: - The paper explores how to train a video event detector only based on the text descriptions of events (such as titles, free - format text explanations, and visual and audio cues) without any video samples labeled as positive classes. This problem is extremely challenging because traditional supervised learning methods usually rely on a large amount of labeled data. 2. **Learning with a small number of positive training samples and a small number of "related" videos**: - When only a very limited number of positive video samples are available, how to combine these samples with some videos closely related to the target event (but not fully meeting the positive class criteria) for learning. The author proposes an extended support vector machine (Relevance Degree SVM, RDSVM), which can automatically assign different weights to different subsets of videos, so as to make better use of these "related" videos. ### Overview of Solutions - **Learning Only from Text Descriptions**: - A general learning framework is proposed, and the impact of different design choices at each stage of this framework on performance is studied. An event detector is constructed by calculating the Explicit Semantic Analysis (ESA) distance between text descriptions and concepts. - **Learning by Combining a Small Number of Positive Samples and Related Videos**: - Using the RDSVM algorithm, the "related" videos are treated as weighted positive or negative samples, thereby improving the generalization ability of the model when positive samples are scarce. The weight parameters of related samples are optimized through cross - validation to achieve the best classification effect. ### Experimental Verification The paper conducts experiments on the large - scale TRECVID MED 2014 video dataset to verify the effectiveness of the proposed methods. The experimental results show that these methods have significant advantages in handling high - dimensional complex event detection tasks, especially when positive samples are scarce. Through the above two methods, the paper aims to narrow the gap between low - level audiovisual features and semantic - level event definitions, thereby improving the accuracy and robustness of video event detection.