Abstract:In the real world, the appearance of identical objects depends on factors as varied as resolution, angle, illumination conditions, and viewing perspectives. This suggests that the data augmentation pipeline could benefit downstream tasks by exploring the overall data appearance in a self-supervised framework. Previous work on self-supervised learning that yields outstanding performance relies heavily on data augmentation such as cropping and color distortion. However, most methods use a static data augmentation pipeline, limiting the amount of feature exploration. To generate representations that encompass scale-invariant, explicit information about various semantic features and are invariant to nuisance factors such as relative object location, brightness, and color distortion, we propose the Multi-View, Multi-Augmentation (MVMA) framework. MVMA consists of multiple augmentation pipelines, with each pipeline comprising an assortment of augmentation policies. By refining the baseline self-supervised framework to investigate a broader range of image appearances through modified loss objective functions, MVMA enhances the exploration of image features through diverse data augmentation techniques. Transferring the resultant representation learning using convolutional networks (ConvNets) to downstream tasks yields significant improvements compared to the state-of-the-art DINO across a wide range of vision tasks and classification tasks: +4.1% and +8.8% top-1 on the ImageNet dataset with linear evaluation and k-NN classifier, respectively. Moreover, MVMA achieves a significant improvement of +5% and +7% on COCO object detection and segmentation.

Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations

Learning Disentangled Representation for Multi-View 3D Object Recognition.

Multi-view and multi-augmentation for self-supervised visual representation learning

MV2MAE: Multi-View Video Masked Autoencoders

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Self-supervised Learning by View Synthesis

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition

Multi-View Transformer for 3D Visual Grounding

View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection

Modeling Long-Range Dependencies and Epipolar Geometry for Multi-View Stereo

Learning the Global Descriptor for 3-D Object Recognition Based on Multiple Views Decomposition

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Multi-View 3d Object Retrieval with Deep Embedding Network

Multiview Transformers for Video Recognition

AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation

Multiple View Geometry Transformers for 3D Human Pose Estimation

Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

MetaViewer: Towards A Unified Multi-View Representation

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding