Abstract:Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest-neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to make more effective use of the dynamic structure information in videos in video representation learning while reducing the need for a large number of negative samples. Specifically, although existing methods based on instance - level contrastive learning have achieved remarkable success in image tasks, these methods are not fully applicable when dealing with video data due to the rich dynamic structure characteristics of video data. The paper proposes a new self - supervised learning method - "Video Cross - Stream Prototypical Contrasting", aiming to learn video embeddings more efficiently by predicting consistent prototype assignments from RGB and optical flow views, and without the need to explicitly calculate optical flow during the inference process. ### Main Contributions 1. **Introduced a new visual - only self - supervised learning framework**: This framework uses a set of views from two streams (RGB and optical flow) for contrastive learning, showing advantages over instance - level contrastive learning, avoiding unnecessary comparisons and calculations, and improving accuracy at the same time. 2. **Proposed a new video training mechanism**: In this mechanism, RGB and optical flow streams are connected to each other in two ways: predicting prototypes from the two streams and alternating the optimization process. By passing motion information to the RGB model, it is possible to choose whether to use the optical flow network according to the speed and efficiency requirements in the deployment scenario. 3. **Conducted extensive ablation studies**: Provided in - depth analysis of the method, and the results show that the method achieves state - of - the - art performance on two backbone networks, S3D and R(2 + 1)D, on the UCF101 and HMDB51 datasets. ### Method Overview - **Preliminary Concepts**: The paper first introduces the basic concepts of instance - level contrastive learning, including the data augmentation module, the embedding function, and the contrastive loss function. - **Predicting Stream Prototype Assignments**: A set of prototypes is used in each stream to avoid instance - level contrast, expanding the data augmentation module and considering RGB frames and optical flow as views. By matching features to prototypes, soft assignments are calculated and optimized using the optimal transport algorithm. - **Cross - Stream Learning**: While optimizing one stream, the information of the other stream is used for prediction, and knowledge is transferred from motion (optical flow) to appearance (RGB) through an alternating training process. ### Experimental Results - **Model Ablation**: The improvement of the cross - stream stage over the single - stream stage was verified through experiments, especially the performance improvement in the nearest - neighbor video retrieval and action recognition tasks. - **Comparison with State - of - the - Art Methods**: On the UCF101 and HMDB51 datasets, this method outperforms existing methods in both the nearest - neighbor video retrieval and action recognition tasks, with improvements of + 3.2% and + 7.2% respectively. ### Formula Presentation - **Contrastive Loss Function**: \[ L_{\text{InfoNCE}}(z_i, z_j)=-\log\frac{\exp(z_i\cdot z_j / \tau)}{\sum_{k\neq i}\exp(z_i\cdot z_k / \tau)} \] where \(\tau\) is the temperature hyperparameter, and \(z_i\cdot z_j\) represents the dot product between normalized vectors, that is, the cosine similarity. - **Single - Stream Prediction Loss**: \[ L_{\text{Single - stream}}^s(z_i^s, z_j^s)=l_s(z_j^s, q_i^s)+l_s(z_i^s, q_j^s) \] where each term represents the cross - entropy loss between the stream prototype assignment \(q\) and the similarity probability calculated by softmax: \[ l_s(z_j^s, q_i^s)=-\sum_k q_i^{(k)}\log\frac{\exp(z_i^s\cdot c_k^s / \tau)}{\sum_{k'}\exp(z_i^s\cdot c_{k'}^s / \tau)} \] - **Cross - Stream Prediction Loss**: \[ L_{\text{Cross - stream}}^s(z_i^s, z_j^s, z_i^t,

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

Contrast and Order Representations for Video Self-supervised Learning.

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Multi-view Self-Supervised Contrastive Learning for Multivariate Time Series

Time-Equivariant Contrastive Video Representation Learning

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding

An Improved Inter-intra Contrastive Learning Framework on Self-supervised Video Representation

Cycle-Contrast for Self-Supervised Video Representation Learning

Controllable Augmentations for Video Representation Learning.

Video Self-Supervised Cross-Pathway Training Based on Slow and Fast Pathways

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Compressed Video Contrastive Learning.

Attentive spatial-temporal contrastive learning for self-supervised video representation

Contrastive Predictive Coding with Transformer for Video Representation Learning

Probabilistic Representations for Video Contrastive Learning

Video Representation Learning with Graph Contrastive Augmentation