Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh,Lauren E. Gillespie,Jael Lopez-Saucedo,Claire Tang,Rohan Sikand,Moisés Expósito-Alonso
2024-09-29
Abstract:Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at <a class="link-external link-http" href="http://hf.co/datasets/andyvhuynh/NatureMultiView" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?