Zero-shot animal behavior classification with vision-language foundation models

Gaspard Dussert,Vincent Miele,Colin Van Reeth,Anne Delestrade,Stephane Dray,Simon Chamaille-Jammes
DOI: https://doi.org/10.1101/2024.04.05.588078
2024-07-07
Abstract:1. Understanding the behavior of animals in their natural habitats is critical to ecology and conservation. Camera traps are a powerful tool to collect such data with minimal disturbance. They however produce very a large quantity of images, which can make human-based annotation cumbersome or even impossible. While automated species identification with artificial intelligence has made impressive progress, automatic classification of animal behaviors in camera trap images remains a developing field. 2. Here, we explore the potential of foundation models, specifically Vision Language Models (VLMs), to perform this task without the need to first train a model, which would require some level of human-based annotation. Using an original dataset of alpine fauna with behaviors annotated by participatory science, we investigate the zero-shot capabilities of different kind of recent VLMs to predict behaviors and estimate behavior-specific diel activity patterns in three ungulate species. 3. Our results show that using these methods, it is possible to achieve accuracies over 91% in behavior classification and produce activity patterns that closely align with those derived from participatory science data (overlap indexes between 84% and 90%). 4. These findings demonstrate the potential of foundation models and vision-language models in ecological research. Ecologists are encouraged to adopt these new methods and leverage their full capabilities to facilitate ecological studies.
Ecology
What problem does this paper attempt to address?