Abstract:In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of symbol grounding in Vision+Language (V+L) tasks and explores how to create machine learning models that can combine symbolic meaning with visual modalities through a systematic literature review. #### Main Objectives: 1. **Propose a new classification method**: Introduce a new classification for vision-grounded language games based on the required skills and abilities. 2. **Dataset analysis**: Analyze 50 relevant datasets proposed over the past 20 years. 3. **Model evaluation**: Analyze 51 recently proposed V+L models. 4. **Future research directions**: Propose research questions related to grounded language learning to guide future V+L research. #### Research Background: - Traditionally, most machine learning models rely solely on large-scale text data for training. While they have achieved significant results in natural language understanding and generation tasks, they cannot fully comprehend the deep meaning of language. - The meaning of symbols comes not only from the text itself but also requires multimodal perceptual experiences (such as vision, function, smell, etc.). - To achieve true natural language understanding, it is necessary to integrate visual information into the models, enabling them to understand the meaning of symbols in specific contexts. #### Main Contributions: - Proposed a classification method based on Wittgenstein's concept of "language games," dividing tasks into three categories: discriminative games, generative games, and interactive games. - Conducted a systematic analysis of existing datasets, covering various types of tasks and environmental characteristics. - Emphasized that future research should focus on interactive games, where natural language communication is crucial for resolving ambiguities in object reference and action planning, and physical embodiment is a key requirement for understanding situational and event semantics. This paper provides an important theoretical foundation and technical guidance for the future development of the Vision+Language field through the above work.

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

Visually Grounded Language Learning: A Review of Language Games, Datasets, Tasks, and Models

An Introduction to Vision-Language Modeling

Vision-Language Models for Vision Tasks: A Survey

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Can 3D Vision-Language Models Truly Understand Natural Language?

Large Language Models and Video Games: A Preliminary Scoping Review

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Analyzing the Roles of Language and Vision in Learning from Limited Data

Vision-Language Models in Remote Sensing: Current progress and future trends

Large Language Models and Games: A Survey and Roadmap

A Survey on Vision-Language-Action Models for Embodied AI

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Learning Visual Grounding from Generative Vision and Language Model

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

A Vision Check-up for Language Models

The Vector Grounding Problem