Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

Alessandro Suglia,Ioannis Konstas,Oliver Lemon
2023-12-05
Abstract:In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of symbol grounding in Vision+Language (V+L) tasks and explores how to create machine learning models that can combine symbolic meaning with visual modalities through a systematic literature review. #### Main Objectives: 1. **Propose a new classification method**: Introduce a new classification for vision-grounded language games based on the required skills and abilities. 2. **Dataset analysis**: Analyze 50 relevant datasets proposed over the past 20 years. 3. **Model evaluation**: Analyze 51 recently proposed V+L models. 4. **Future research directions**: Propose research questions related to grounded language learning to guide future V+L research. #### Research Background: - Traditionally, most machine learning models rely solely on large-scale text data for training. While they have achieved significant results in natural language understanding and generation tasks, they cannot fully comprehend the deep meaning of language. - The meaning of symbols comes not only from the text itself but also requires multimodal perceptual experiences (such as vision, function, smell, etc.). - To achieve true natural language understanding, it is necessary to integrate visual information into the models, enabling them to understand the meaning of symbols in specific contexts. #### Main Contributions: - Proposed a classification method based on Wittgenstein's concept of "language games," dividing tasks into three categories: discriminative games, generative games, and interactive games. - Conducted a systematic analysis of existing datasets, covering various types of tasks and environmental characteristics. - Emphasized that future research should focus on interactive games, where natural language communication is crucial for resolving ambiguities in object reference and action planning, and physical embodiment is a key requirement for understanding situational and event semantics. This paper provides an important theoretical foundation and technical guidance for the future development of the Vision+Language field through the above work.