Abstract:As Artificial Intelligence (AI) has developed rapidly over the past few decades, the new generation of AI, Large Language Models (LLMs) trained on massive datasets, has achieved ground-breaking performance in many applications. Further progress has been made in multimodal LLMs, with many datasets created to evaluate LLMs with vision abilities. However, none of those datasets focuses solely on marine mammals, which are indispensable for ecological equilibrium. In this work, we build a benchmark dataset with 1,423 images of 65 kinds of marine mammals, where each animal is uniquely classified into different levels of class, ranging from species-level to medium-level to group-level. Moreover, we evaluate several approaches for classifying these marine mammals: (1) machine learning (ML) algorithms using embeddings provided by neural networks, (2) influential pre-trained neural networks, (3) zero-shot models: CLIP and LLMs, and (4) a novel LLM-based multi-agent system (MAS). The results demonstrate the strengths of traditional models and LLMs in different aspects, and the MAS can further improve the classification performance. The dataset is available on GitHub: <a class="link-external link-https" href="https://github.com/yeyimilk/LLM-Vision-Marine-Animals.git" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the following two main problems: 1. **Lack of an image classification dataset specifically for marine mammals**: - Although there are currently many datasets for evaluating large - scale language models (LLMs) and multimodal models, most of these datasets cover a wide range of categories, such as general objects or specific animal groups, but lack datasets focusing on marine mammals. This makes it difficult for researchers to conduct detailed analysis of marine mammals and develop targeted conservation strategies. - To solve this problem, the author constructed a new dataset containing 1,423 images, covering 65 species of marine mammals, and each animal is classified into different hierarchical categories (from species - level to intermediate - level to group - level). 2. **Evaluating the performance of existing models in the marine mammal image classification task**: - The author evaluated several different methods for classifying these marine mammal images, including: - Traditional machine learning (ML) algorithms, using embeddings provided by neural networks. - Influential pre - trained neural networks. - Zero - shot models: CLIP and LLMs. - A novel LLM - based multi - agent system (MAS). - Through these evaluations, the author hopes to reveal the advantages and limitations of different models in the marine mammal image classification task, especially to explore the potential of LLMs in this field. ### Main contributions - **New dataset**: Provided a high - quality image dataset specifically for marine mammals, filling the gap in existing datasets. - **Comprehensive model evaluation**: Through detailed evaluation of multiple models, provided valuable insights into the performance of different methods in the marine mammal classification task. - **Innovative multi - agent system**: Proposed and developed an LLM - based multi - agent system (MAS), further improving the classification performance. ### Conclusion Through these contributions, the author not only solved the current data scarcity problem in research but also promoted the technological progress in the field of marine mammal image classification, providing a solid foundation for future related research.

Benchmarking Large Language Models for Image Classification of Marine Mammals

MarineGPT: Unlocking Secrets of Ocean to the Public

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

A Survey on Benchmarks of Multimodal Large Language Models

Coastal Fisheries Resource Monitoring Through A Deep Learning-Based Underwater Video Analysis

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Multimodal Large Language Models for Bioimage Analysis

Large Language Model Benchmarks in Medical Tasks

InfMLLM: A Unified Framework for Visual-Language Tasks.

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models