Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

Zalan Fabian,Zhongqi Miao,Chunyuan Li,Yuanhan Zhang,Ziwei Liu,Andrés Hernández,Andrés Montes-Rojas,Rafael Escucha,Laura Siabatto,Andrés Link,Pablo Arbeláez,Rahul Dodhia,Juan Lavista Ferres

2023-11-02

Abstract:Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of species identification in camera trap images for wildlife monitoring. Specifically, the study proposes a new framework called WildMatch, which leverages a multimodal foundational model for zero-shot animal species classification. The paper points out that traditional supervised learning methods require a large amount of annotated data to handle such tasks, which is not only labor-intensive but also necessitates the recollection of data when deployed in new areas. Therefore, this study attempts to reduce the reliance on expensive annotated data and develop a large-scale wildlife tracking solution. WildMatch achieves zero-shot classification by adapting a vision-language model to generate detailed descriptions of camera trap images and matching these descriptions with species descriptions from an external knowledge base. Additionally, the study introduces a novel knowledge enhancement technique to improve the quality of the descriptions and demonstrates the effectiveness of this method on a new camera trap dataset collected in the Magdalena region of Colombia. In summary, the study aims to enhance the accuracy and robustness of species identification by reducing the need for annotated data.

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

Reviving the Context: Camera Trap Species Classification as Link Prediction on Multimodal Knowledge Graphs

CECS-CLIP: Fusing Domain Knowledge for Rare Wildlife Detection Model

Zero-shot animal behavior classification with vision-language foundation models

Animal Recognition and Identification with Deep Convolutional Neural Networks for Automated Wildlife Monitoring

Improving the accessibility and transferability of machine learning algorithms for identification of animals in camera trap images: MLWIC2

Class incremental learning for wildlife biodiversity monitoring in camera trap images

Metadata augmented deep neural networks for wild animal classification

Multiobject Tracking of Wildlife in Videos Using Few-Shot Learning

In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation

Improved Re-Parameterized Convolution for Wildlife Detection in Neighboring Regions of Southwest China

Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife Recognition in UAV Images

Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Image-to-Image Translation of Synthetic Samples for Rare Classes

A deep active learning system for species identification and counting in camera trap images

Pytorch-Wildlife: A Collaborative Deep Learning Framework for Conservation

Automatic Recognition of Mammal Genera on Camera-Trap Images using Multi-Layer Robust Principal Component Analysis and Mixture Neural Networks

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Improved Wildlife Recognition through Fusing Camera Trap Images and Temporal Metadata