Abstract:Learning-based video annotation is essential for video analysis and understanding, and many various approaches have been proposed to avoid the intensive labor costs of purely manual annotation. However, there lacks a generic framework due to several difficulties, such as dependence of domain knowledge, insufficiency of training data, no precise localization and inefficacy for large-scale video dataset. In this paper, we propose a novel approach based on semi-supervised learning by means of information from the Internet for interesting event annotation in videos. Concretely, a Fast Graph-based Semi-Supervised Multiple Instance Learning (FGSSMIL) algorithm, which aims to simultaneously tackle these difficulties in a generic framework for various video domains (e. g., sports, news, and movies), is proposed to jointly explore small-scale expert labeled videos and large-scale unlabeled videos to train the models. The expert labeled videos are obtained from the analysis and alignment of well-structured video related text (e. g., movie scripts, web-casting text, close caption). The unlabeled data are obtained by querying related events from the video search engine (e. g., YouTube, Google) in order to give more distributive information for event modeling. Two critical issues of FGSSMIL are: 1) how to calculate the weight assignment for a graph construction, where the weight of an edge specifies the similarity between two data points. To tackle this problem, we propose a novel Multiple Instance Learning Induced Similarity (MILIS) measure by learning instance sensitive classifiers; 2) how to solve the algorithm efficiently for large-scale dataset through an optimization approach. To address this issue, Concave-Convex Procedure (CCCP) and nonnegative multiplicative updating rule are adopted. We perform the extensive experiments in three popular video domains: movies, sports, and news. The results compared with the state-of-the-arts are promising and demonstrate the effectiveness and efficiency of our proposed approach.

Adaptive Pooling in Multi-instance Learning for Web Video Annotation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering.

Scene Aligned Pooling For Complex Video Recognition

Provable Multi-instance Deep AUC Maximization with Stochastic Pooling

Ensemble Multi-Instance Multi-Label Learning Approach for Video Annotation Task

A Generic Framework for Video Annotation Via Semi-Supervised Learning.

AutoPooling: Automated Pooling Search for Multi-valued Features in Recommendations.

Semantic Pooling for Complex Event Analysis in Untrimmed Videos

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Semi-supervised multi-instance multi-label learning for video annotation task.

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval.

Fusing Multi-Stream Deep Networks for Video Classification

Scalable Multi-instance Learning

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

FedMIL: Federated-Multiple Instance Learning for Video Analysis with Optimized DPP Scheduling

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Multi-instance Positive and Unlabeled Learning with Bi-Level Embedding

Image Annotation by Multiple-Instance Learning with Discriminative Feature Mapping and Selection