At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models

Dimitrios Tanoglidis,Bhuvnesh Jain
2024-06-25
Abstract:Vision-Language multimodal Models (VLMs) offer the possibility for zero-shot classification in astronomy: i.e. classification via natural language prompts, with no training. We investigate two models, GPT-4o and LLaVA-NeXT, for zero-shot classification of low-surface brightness galaxies and artifacts, as well as morphological classification of galaxies. We show that with natural language prompts these models achieved significant accuracy (above 80 percent typically) without additional training/fine tuning. We discuss areas that require improvement, especially for LLaVA-NeXT, which is an open source model. Our findings aim to motivate the astronomical community to consider VLMs as a powerful tool for both research and pedagogy, with the prospect that future custom-built or fine-tuned models could perform better.
Instrumentation and Methods for Astrophysics,Astrophysics of Galaxies,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use large - scale multimodal models (VLMs) for zero - shot classification in astronomy. Specifically, the author explores how to use multimodal models to classify low - surface - brightness galaxies (LSBGs) from artifacts and galaxy morphology through natural - language prompts without additional training or fine - tuning. The paper mainly focuses on two specific classification tasks: 1. **Classification of low - surface - brightness galaxies (LSBGs) and artifacts**: - The goal is to distinguish low - surface - brightness galaxies (LSBGs) from non - LSBG images, where the latter may include eccentric bright galaxies, diffuse light, bright stars, and light reflections, etc. 2. **Galaxy morphology classification**: - Galaxies are divided into four categories: smooth and circular galaxies, smooth and cigar - shaped galaxies, edge - on disk galaxies, and non - barred spiral galaxies. The paper shows the performance of two multimodal models (GPT - 4o and LLaVA - NeXT) on these tasks through experiments and discusses their advantages and room for improvement. The author hopes that through these studies, the astronomy community will be inspired to consider using multimodal models as powerful research and teaching tools.