Abstract:Video action segmentation aims to densely cast each video frame into a set of pre-defined human action categories. This work proposes a novel model, dubbed as diffused Fourier network (DFN) for video action segmentation. It advances the research frontier by addressing several central bottlenecks in the existing methods for video action segmentation. First, capturing long-range dependence among video frames is known to be crucial for precisely estimating the temporal boundaries for actions. Rather than relying on compute-intensive self-attention modules or stacking multi-rate dilated convolutions as in previous models (e.g., ASFormer), we devise Fourier token mixer over shiftable temporal windows in the video sequence, which harnesses the parameter-free and light-weighted Fast Fourier Transform (FFT) for efficient spectral-temporal feature learning. Essentially, even simple spectral operations (e.g., pointwise product) bring global receptive field across the entire temporal window. The proposed Fourier token mixer thus provides a low-cost alternative for existing practice. Secondly, the results of action segmentation tend to be fragmented, primarily due to the noisy per-frame action likelihood, known as over-segmentation in the literature. Inspired by the recently-proposed diffusion models, we treat over-segments as noises corrupting the true temporal boundaries, and conduct denoising via a recurrent execution of a parameter-sharing module, akin to the backward denoising process in the diffusion models. Comprehensive experiments on three video benchmarks (GTEA, 50salads and Breakfast) have clearly validated that the proposed method can strike an excellent balance between computations / parameter count and accuracy.

Frequency Enhancement Network for Efficient Compressed Video Action Recognition

F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Dynamic Spatial Focus for Efficient Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

FEASE: Feature Selection and Enhancement Networks for Action Recognition

Efficient spatio-temporal network for action recognition

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

Diffused Fourier Network for Video Action Segmentation

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

FEXNet: Foreground Extraction Network for Human Action Recognition

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

An Improved Action Recognition Network With Temporal Extraction and Feature Enhancement

Multi-head attention-based two-stream EfficientNet for action recognition

ACTION-Net: Multipath Excitation for Action Recognition

Multi-Stream Single Network: Efficient Compressed Video Action Recognition With a Single Multi-Input Multi-Output Network

Fine-Tuned Temporal Dense Sampling with 1D Convolutional Neural Network for Human Action Recognition

Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition