Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Sherzod Hakimov,David Schlangen
DOI: https://doi.org/10.48550/arXiv.2305.13782
2023-05-23
Computation and Language
Abstract:Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input -- but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model's output by providing a means of tracing the output back through the verbalised image content.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use large - language models (LLMs) that have been trained only on text data to complete multi - modal tasks, especially those involving vision and language. Specifically, researchers explored whether these language models can effectively handle tasks requiring visual input by converting image information into the form of text descriptions. This method not only aims to improve the performance of the model on multi - modal tasks, but also hopes to enhance the interpretability of the model output by providing a way to trace back to the image content of the model output. The core problem of the paper is to evaluate the performance of large - language models on five vision - language tasks when given text - encoded visual information. These five tasks include four classification tasks and one question - answering task. In addition, the study also explored the impact of different text description generation methods on model performance and compared the performance of open - source and open - access models with GPT - 3 on these selected vision - language tasks. In short, the paper attempts to answer the following key questions: 1. Can large - language models effectively solve vision - language tasks through text - encoded visual information? 2. How do different text description generation methods affect the model's performance on these tasks? 3. How does the performance of open - source and open - access language models differ from that of GPT - 3 when handling vision - language tasks?