Abstract:We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to associate astronomical observation data (such as images taken by the Hubble Space Telescope) with natural - language descriptions. Specifically, the author proposes a method named **PAPERCLIP**, which utilizes the pre - trained Contrastive Language - Image Pre - training (CLIP) model and fine - tunes it with successful observation proposal summaries and corresponding observation data to achieve this goal. #### Main problems: 1. **Multi - modal data association**: How to establish an effective association between images and text in the field of astronomy, so that relevant astronomical observation data can be retrieved through natural - language queries. 2. **Improving model generalization ability**: By fine - tuning the pre - trained CLIP model, make it better able to handle astronomy data in specific fields, rather than just general image - text pairs. 3. **Evaluating model performance**: Verify whether the fine - tuned model can perform well in image retrieval and description retrieval tasks, that is, whether it can find the most relevant observation images according to natural - language queries, and find the most relevant celestial body categories and scientific uses according to the given observation images. #### Solution overview: - **Dataset construction**: Use the observation data of the Hubble Space Telescope and the corresponding observation proposal summaries to construct a dataset. In order to enhance the consistency of text descriptions, an attempt is also made to summarize the summaries using large - language models (LLM). - **Model fine - tuning**: Based on the pre - trained CLIP model, use the above - mentioned dataset for fine - tuning. Three different fine - tuning strategies are explored: full fine - tuning, freezing the base encoder and only training the projection head, and training the entire model from scratch. - **Evaluation metrics**: Evaluate the model performance through metrics such as contrastive loss and Top - k% retrieval accuracy, and conduct qualitative and quantitative experimental analyses. Through these methods, the paper shows how to use contrastive learning techniques to associate astronomical observation data with natural - language descriptions, providing a new interface method for the interaction of astronomical data.

PAPERCLIP: Associating Astronomical Observations and Natural Language with Multi-Modal Models

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Linking Representations with Multimodal Contrastive Learning

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

DiffCLIP: Few-shot Language-driven Multimodal Classifier

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Contrastive Localized Language-Image Pre-Training

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Finetuning CLIP to Reason about Pairwise Differences

Training CLIP models on Data from Scientific Papers

AstroCLIP: a cross-modal foundation model for galaxies

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Multi-Modal Adapter for Vision-Language Models

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning