Abstract:This notebook paper presents an overview and comparative analysis of our systems designed for the following three tasks in ActivityNet Challenge 2017: trimmed action recognition, temporal action proposals and densecaptioning events in videos. Trimmed Action Recognition (TAR): We investigate and exploit multiple spatio-temporal clues for trimmed action recognition (TAR) task, i.e., frame, short video clip and motion (optical flow) by leveraging 2D or 3D convolutional neural networks (CNNs). The mechanism of different quantization methods is studied as well. Furthermore, improved dense trajectory with fisher vector encoding over the whole trimmed video is utilized. All activities are finally classified by late fusing the predictions of one-versus-rest linear SVMs learnt on each clue. Temporal Action Proposals (TAP): To generate temporal action proposals from videos, a three-stage workflow is particularly devised for TAP task. Given an untrimmed video, our system firstly generates an actionness curve via a snippet-level actionness classifier. The temporal actionness grouping scheme is then exploited over actionness curve to produce proposal candidates. Finally, a proposal re-ranking procedure is incorporated to select high-quality proposals via a proposal-level actionness classifier. Dense-Captioning Events in Videos (DCEV): For DCEV task, we firstly adopt our temporal action proposal system mentioned above to localize temporal proposals of interest in video, and then generate the descriptions for each proposal. Specifically, RNNs encode a given video and its detected attributes into a fixed dimensional vector, and then decode it to the target output sentence. Moreover, we extend the attributes-based CNNs plus RNNs model with policy gradient optimization and retrieval mechanism to further boost video captioning performance.

Multipath 3D-Conv encoder and temporal-sequence decision for repetitive-action counting

Multi-branch Progressive Embedding Network for Crowd Counting

Repetitive Action Counting with Hybrid Temporal Relation Modeling

Context-Aware and Scale-Insensitive Temporal Repetition Counting

Rethinking temporal self-similarity for repetitive action counting

Temporal Distinct Representation Learning for Action Recognition

FCA-RAC: First Cycle Annotated Repetitive Action Counting

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

MultiCounter: Multiple Action Agnostic Repetition Counting in Untrimmed Videos

Efficient Action Counting with Dynamic Queries

A Real-Time Action Representation With Temporal Encoding and Deep Compression

Short-Term Action Recognition by 3D Convolutional Neural Network with Pixel-Wise Evidences

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Accelerating temporal action proposal generation via high performance computing

Enhanced 3D convolutional networks for crowd counting

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

ACTION-Net: Multipath Excitation for Action Recognition

Energy-Based Periodicity Mining With Deep Features for Action Repetition Counting in Unconstrained Videos

MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos

Multi-scale Dynamic Network for Temporal Action Detection.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation