Abstract:Unsupervised representation learning for videos has recently achieved remarkable performance owing to the effectiveness of contrastive learning. Most works on video contrastive learning (VCL) pull all snippets from the same video into the same category, even if some of them are from different actions, leading to temporal collapse, i.e., the snippet representations of a video are invariable with the evolution of time. In this paper, we introduce a novel intra-video contrastive learning (intra-VCL) that further distinguishes intra-video actions to alleviate this issue, which includes an asynchronous long-term memory bank (that caches the representations of all snippets of each video) and mines an extra positive/negative snippet within a video based on the asynchronous long-term memory bank. In addition, since an asynchronous long-term memory bank is required for performing intra-VCL and asynchronous update of the long-term memory leads to inconsistencies when performing contrastive learning, we further propose a consistent contrastive module (CCM) to perform consistent intra-VCL. Specifically, in the CCM, we propose an intra-video self-attention refinement function to reduce the inconsistencies within the asynchronously updated representations (of all snippets of each video) in the long-term memory and an adaptive loss re-weighting to reduce unreliable self-supervision produced by inconsistent contrastive pairs. We call our method as consistent intra-VCL. Extensive experiments demonstrate the effectiveness of the proposed consistent intra-VCL, which achieves state-of-the-art performance on the standard benchmarks of self-supervised action recognition, with top-1 accuracies of 64.2% and 91.0% on HMDB-51 and UCF-101, respectively.

Temporally Consistent Unpaired Multi-domain Video Translation by Contrastive Learning

Video-to-Video Translation with Global Temporal Consistency.

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

Multi-cropping Contrastive Learning and Domain Consistency for Unsupervised Image-to-Image Translation

Unpaired Image-to-Image Translation Using Adversarial Consistency Loss

Exploring Spatiotemporal Consistency of Features for Video Translation in Consumer Internet of Things

Video Contrastive Learning with Global Context

Image-to-Image Translation with Multi-Path Consistency Regularization

Learning Blind Video Temporal Consistency

Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Temporally consistent video colorization with deep feature propagation and self-regularization learning

Consistent Video Style Transfer Via Compound Regularization.

Preserving Global and Local Temporal Consistency for Arbitrary Video Style Transfer

Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

Towards Diverse Image-to-image Translation Via Adaptive Normalization Layer and Contrast Learning

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Consistent Intra-Video Contrastive Learning with Asynchronous Long-Term Memory Bank

Unsupervised Multi-Domain Multimodal Image-to-image Translation with Explicit Domain-Constrained Disentanglement.