Abstract:Visual tempos show the dynamics of action instances, characterizing the diversity of the actions, such as walking slowly and running quickly. To facilitate action recognition, it is essential to capture visual tempos. To this end, previous methods sample raw videos at multiple frame rates or integrate multi-scale temporal features. These methods inevitably introduce two-stream networks or feature-level pyramid structures, leading to expensive computation. In this work, we propose a progressive difference method to capture visual tempos for efficient action recognition, by computing coarse-to-fine motion information within a small neighborhood around temporal frames. Specifically, the uniform sampling method is first applied to each video, and then first-order temporal differences around each frame are calculated to describe local motions. On the basis of differences, further computing the variations of differences, namely second-order differences, can gradually capture fine-grained spatiotemporal features and characterize the areas where the motion cues are more prominent. On one hand, multi-order motion differences can be combined with raw input to describe the diversity of the actions. On the other hand, the variations of first-order differences information can be used to activate first-order salient motion regions, thereby facilitating the discrimination of finer-grained actions. Our method can be combined with existing backbones in a plug-and-play manner. Extensive experiments are conducted on several video benchmarks, including Kinetics400, HMDB51, UCF101, UAV-Human, Something-Something V1 and V2. We also give detailed analysis and qualitative experiments to demonstrate the effectiveness of our method.

Team SPEEDY Multi Moments in Time Challenge 2019 Technical Report

Top-1 Solution of Multi-Moments in Time Challenge 2019

Single-Camera and Inter-Camera Vehicle Tracking and 3D Speed Estimation Based on Fusion of Visual and Semantic Features

Technical Report for Ego4D Long Term Action Anticipation Challenge 2023

Technical Report: Competition Solution For Modelscope-Sora

FASTER Recurrent Networks for Efficient Video Classification

The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

T-RECS: Training for Rate-Invariant Embeddings by Controlling Speed for Action Recognition

1st Place Solution to the 1st SkatingVerse Challenge

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Msr asia msm at activitynet challenge 2016

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Faster Video Moment Retrieval with Point-Level Supervision

Uts-cmu at thumos 2015

SlowFast Networks for Video Recognition

NJU MCG - Sensetime Team Submission to Pre-training for Video Understanding Challenge Track II.

A Progressive Difference Method for Capturing Visual Tempos on Action Recognition

MOTPose: Multi-object 6D Pose Estimation for Dynamic Video Sequences using Attention-based Temporal Fusion

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Towards Real-Time Open-Vocabulary Video Instance Segmentation