Abstract:Techniques for recognizing high-level events in consumer videos on the Internet have many applications. Systems that produced state-of-the-art recognition performance usually contain modules requiring extensive computation, such as the extraction of the temporal motion trajectories, which cannot be deployed on large-scale datasets. In this paper, we provide a comprehensive study on efficient methods in this area and identify technical options for super fast event recognition in Internet videos. We start from analyzing a multimodal baseline that has produced good performance on popular benchmarks, by systematically evaluating each component in terms of both computational cost and contribution to recognition accuracy. After that, we identify alternative features, classifiers, and fusion strategies that can all be efficiently computed. In addition, we also provide a study on the following interesting question: for event recognition in Internet videos, what is the minimum number of visual and audio frames needed to obtain a comparable accuracy to that of using all the frames? Results on two rigorously designed datasets indicate that similar results can be maintained by using only a small portion of the visual frames. We also find that, different from the visual frames, the soundtracks contain little redundant information and thus sampling is always harmful. Integrating all the findings, our suggested recognition system is 2,350-fold faster than a baseline approach with even higher recognition accuracies. It recognizes 20 classes on a 120-second video sequence in just 1.78 seconds, using a regular desktop computer.

Instantly Telling What Happens in a Video Sequence Using Simple Features

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

Learning Discriminative Features for Fast Frame-Based Action Recognition.

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Dynamic Inference: A New Approach Toward Efficient Video Action Recognition

Automatic And Robust Classification Of Independent Motions In Video Sequences

Fast and Reliable Human Action Recognition in Video Sequences by Sequential Analysis

Super Fast Event Recognition in Internet Videos

Action Recognition Based on Discrete Cosine Transform by Optical Pixel-Wise Encoding

Segmenting Visual Actions Based on Spatio-Temporal Motion Patterns

A Fast Video Event Recognition System and Its Application to Video Search

Learning realistic human actions from movies.

Discriminative Optical Flow Tensor for Video Semantic Analysis

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Automatic Detection and Analysis of Player Action in Moving Background Sports Video Sequences

Effective Action Recognition with Embedded Key Point Shifts

Analysis of Human Actions for Video Indexing

Fast Video Segment Identification from Large Video Collection

Condensing a Sequence to One Informative Frame for Video Recognition

Real Time Human Action Recognition in a Long Video Sequence

Automatic video scene segmentation based on spatial-temporal clues and rhythm