Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Sherzod Hakimov,David Schlangen

DOI: https://doi.org/10.48550/arXiv.2305.13782

2023-05-23

Computation and Language

Abstract:Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input -- but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model's output by providing a means of tracing the output back through the verbalised image content.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large - language models (LLMs) that have been trained only on text data to complete multi - modal tasks, especially those involving vision and language. Specifically, researchers explored whether these language models can effectively handle tasks requiring visual input by converting image information into the form of text descriptions. This method not only aims to improve the performance of the model on multi - modal tasks, but also hopes to enhance the interpretability of the model output by providing a way to trace back to the image content of the model output. The core problem of the paper is to evaluate the performance of large - language models on five vision - language tasks when given text - encoded visual information. These five tasks include four classification tasks and one question - answering task. In addition, the study also explored the impact of different text description generation methods on model performance and compared the performance of open - source and open - access models with GPT - 3 on these selected vision - language tasks. In short, the paper attempts to answer the following key questions: 1. Can large - language models effectively solve vision - language tasks through text - encoded visual information? 2. How do different text description generation methods affect the model's performance on these tasks? 3. How does the performance of open - source and open - access language models differ from that of GPT - 3 when handling vision - language tasks?

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Visual cognition in multimodal large language models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Empowering MultiModal Models' In-Context Learning Ability through Large Language Models.

Large language models predict human sensory judgments across six modalities

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Towards Interpreting Visual Information Processing in Vision-Language Models

Cross-Modal Consistency in Multimodal Large Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

Language Is Not All You Need: Aligning Perception with Language Models

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Visually-Augmented Language Modeling