Abstract:We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MK- using the prelearned classifiers only from each individual event class.

Finding Event Videos Via Image Search Engine

Joint Searching and Grounding: Multi-Granularity Video Content Retrieval

A Novel Learning-Based Frame Pooling Method for Event Detection.

Webly-Supervised Video Recognition By Mutually Voting For Relevant Web Images And Web Video Frames

Web Video Event Recognition by Semantic Analysis from Ubiquitous Documents

Localizing Web Videos from Heterogeneous Images.

Leveraging Collective Wisdom for Web Video Retrieval Through Heterogeneous Community Discovery

Action and Event Recognition in Videos by Learning From Heterogeneous Web Sources.

Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base.

Fusing Cross-Media for Topic Detection by Dense Keyword Groups

Visual event recognition in videos by learning from Web data.

Exploiting Web Images for Event Recognition in Consumer Videos: A Multiple Source Domain Adaptation Approach

Exploiting Web Images for Semantic Video <newline/>indexing Via Robust Sample-Specific Loss

A generic framework for event detection in various video domains.

Automatic Visual Concept Learning for Social Event Understanding

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

Using Bag of Visual Words for Video Retrieval Calibration

Multimodal Information Joint Learning for Geotagged Image Search.

They Are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

Scalable Video Event Retrieval by Visual State Binary Embedding.