Abstract:Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest-neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to make more effective use of the dynamic structure information in videos in video representation learning while reducing the need for a large number of negative samples. Specifically, although existing methods based on instance - level contrastive learning have achieved remarkable success in image tasks, these methods are not fully applicable when dealing with video data due to the rich dynamic structure characteristics of video data. The paper proposes a new self - supervised learning method - "Video Cross - Stream Prototypical Contrasting", aiming to learn video embeddings more efficiently by predicting consistent prototype assignments from RGB and optical flow views, and without the need to explicitly calculate optical flow during the inference process.
### Main Contributions
1. **Introduced a new visual - only self - supervised learning framework**: This framework uses a set of views from two streams (RGB and optical flow) for contrastive learning, showing advantages over instance - level contrastive learning, avoiding unnecessary comparisons and calculations, and improving accuracy at the same time.
2. **Proposed a new video training mechanism**: In this mechanism, RGB and optical flow streams are connected to each other in two ways: predicting prototypes from the two streams and alternating the optimization process. By passing motion information to the RGB model, it is possible to choose whether to use the optical flow network according to the speed and efficiency requirements in the deployment scenario.
3. **Conducted extensive ablation studies**: Provided in - depth analysis of the method, and the results show that the method achieves state - of - the - art performance on two backbone networks, S3D and R(2 + 1)D, on the UCF101 and HMDB51 datasets.
### Method Overview
- **Preliminary Concepts**: The paper first introduces the basic concepts of instance - level contrastive learning, including the data augmentation module, the embedding function, and the contrastive loss function.
- **Predicting Stream Prototype Assignments**: A set of prototypes is used in each stream to avoid instance - level contrast, expanding the data augmentation module and considering RGB frames and optical flow as views. By matching features to prototypes, soft assignments are calculated and optimized using the optimal transport algorithm.
- **Cross - Stream Learning**: While optimizing one stream, the information of the other stream is used for prediction, and knowledge is transferred from motion (optical flow) to appearance (RGB) through an alternating training process.
### Experimental Results
- **Model Ablation**: The improvement of the cross - stream stage over the single - stream stage was verified through experiments, especially the performance improvement in the nearest - neighbor video retrieval and action recognition tasks.
- **Comparison with State - of - the - Art Methods**: On the UCF101 and HMDB51 datasets, this method outperforms existing methods in both the nearest - neighbor video retrieval and action recognition tasks, with improvements of + 3.2% and + 7.2% respectively.
### Formula Presentation
- **Contrastive Loss Function**:
\[
L_{\text{InfoNCE}}(z_i, z_j)=-\log\frac{\exp(z_i\cdot z_j / \tau)}{\sum_{k\neq i}\exp(z_i\cdot z_k / \tau)}
\]
where \(\tau\) is the temperature hyperparameter, and \(z_i\cdot z_j\) represents the dot product between normalized vectors, that is, the cosine similarity.
- **Single - Stream Prediction Loss**:
\[
L_{\text{Single - stream}}^s(z_i^s, z_j^s)=l_s(z_j^s, q_i^s)+l_s(z_i^s, q_j^s)
\]
where each term represents the cross - entropy loss between the stream prototype assignment \(q\) and the similarity probability calculated by softmax:
\[
l_s(z_j^s, q_i^s)=-\sum_k q_i^{(k)}\log\frac{\exp(z_i^s\cdot c_k^s / \tau)}{\sum_{k'}\exp(z_i^s\cdot c_{k'}^s / \tau)}
\]
- **Cross - Stream Prediction Loss**:
\[
L_{\text{Cross - stream}}^s(z_i^s, z_j^s, z_i^t,