Abstract:Videos naturally contain dynamic variation over the temporal axis, which will result in the same visual clues (e.g., semantics, objects) changing their scale, position, and perspective patterns between adjacent frames. A primary trend in video CNN is adopting spatial-2D convolution for spatial semantics and temporal-1D convolution for temporal dynamics. Though the direction achieves a favorable balance between efficiency and efficacy, it suffers from misalignment of visual clues with large displacements. Particularly, rigid temporal convolution would fail to capture correct motions when a specific target moves out of the reception field of temporal convolution between adjacent frames. To tackle large visual displacements between temporal neighbors, we propose a new temporal convolution named Hourglass Convolution (HgC). The temporal reception field of HgC has an hourglass shape, where the spatial reception field is enlarged in prior & post temporal frames, enabling an ability to capture large displacement. Moreover, since videos contain long, short-term movements viewed from multiple temporal interval levels, we hierarchically organize the HgC net to both capture temporal dynamics from frame (short-term) and clip (long-term) levels. Besides, we also adopt strategies, such as low-resolution for short-term modeling and channel reduction for long-term modeling, from efficiency concerns. With HgC, our (HCN)-C-2 equips off-the-shelf CNNs with a strong ability in capturing spatio-temporal dynamics at a neglectable computation overhead. We validate the efficiency and efficacy of HgC on standard action recognition benchmarks, including Something-Something V1&V2, Diving48, and EGTEA Gaze+. We also analyse the complementarity of frame-level motion and clip-level motion with visualizations. The code and models will be available at https://github.com/ty-97/H2CN.

Horizontal-to-Vertical Video Conversion

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

H2V4Sports: Real-Time Horizontal-to-Vertical Video Converter for Sports Lives Via Fast Object Detection and Tracking

Real-time Human-Centric Segmentation for Complex Video Scenes

HVConv: Horizontal and Vertical Convolution for Remote Sensing Object Detection

Hierarchical Hourglass Convolutional Network for Efficient Video Classification

Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation

HDR Video Reconstruction with a Large Dynamic Dataset in Raw and sRGB Domains

Towards Open-Vocabulary Video Instance Segmentation

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

A New Journey from SDRTV to HDRTV

HMFVC: A Human-Machine Friendly Video Compression Scheme

A Unified Framework for Human-centric Point Cloud Video Understanding

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

H2-Stereo: High-Speed, High-Resolution Stereoscopic Video System

A Spatial-Temporal Video Quality Assessment Method via Comprehensive HVS Simulation

HVS Revisited: A Comprehensive Video Quality Assessment Framework

Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions

Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization