Abstract:Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption generation). Most of previous works often neglect the hierarchical action granularity and eventually fail to identify precise action boundaries. (e.g., embracing approaching or turning a screw in mechanical maintenance). In this paper, we introduce a simple yet efficient coarse-to-fine network (CFNet) to solve the challenging issue of temporal action localization by progressively refining action boundary at multiple action granularities. The proposed CFNet is mainly composed of three components: a coarse proposal module (CPM) to generate coarse action candidates, a fusion block (FB) to enhance feature representation by fusing the coarse candidate features and corresponding features of raw input frames, and a boundary transformer module (BTM) to further refine action boundaries. Specifically, CPM exploits framewise, matching and gated actionness curves to complement each other for coarse candidate generation at different levels, while FB is devised to enrich feature representation by fusing the last feature map of CPM and corresponding raw frame input. Finally, BTM learns long-term temporal dependency with a transformer structure to further refine action boundaries at a finer granularity. Thus, the fine-grained action intervals can be incrementally obtained. Compared with previous state-of-the-art techniques, the proposed coarse-to-fine network can asymptotically approach fine-grained action boundary. Comprehensive experiments are conducted on both publicly available THUMOS14 and ActivityNet-v1.3 datasets, and show the outstanding improvements of our method when compared with the prior methods on various video action parsing tasks.

Atrous Temporal Convolutional Network For Video Action Segmentation

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Stacking-Based Attention Temporal Convolutional Network for Action Segmentation

A motion-aware and temporal-enhanced Spatial–Temporal Graph Convolutional Network for skeleton-based human action segmentation

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

A Temporal Convolutional Network for Weakly Supervised Action Segmentation

Pyramid Dilated Attention Network for Action Segmentation

Attentional Fused Temporal Transformation Network for Video Action Recognition.

Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation

Temporal Segment Networks for Action Recognition in Videos

Temporal Segment Transformer for Action Segmentation

Adaptive Temporal Segmentation for Action Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

SCALE MATTERS: TEMPORAL SCALE AGGREGATION NETWORK FOR PRECISE ACTION LOCALIZATION IN UNTRIMMED VIDEOS

Adaptive Receptive Field U-shaped Temporal Convolutional Network for Vulgar Action Segmentation

Temporal Action Localization with Coarse-to-Fine Network

Spatial-temporal Pyramid Based Convolutional Neural Network for Action Recognition

3D Convolutional Two-Stream Network for Action Recognition in Videos

C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action Segmentation

TSRN: Two-Stage Refinement Network for Temporal Action Segmentation