Intra-agent speech permits zero-shot task acquisition

Chen Yan,Federico Carnevale,Petko Georgiev,Adam Santoro,Aurelia Guy,Alistair Muldal,Chia-Chun Hung,Josh Abramson,Timothy Lillicrap,Gregory Wayne
DOI: https://doi.org/10.48550/arXiv.2206.03139
2022-06-07
Abstract:Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of "inner speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behavior. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enable agents to efficiently learn new tasks by modeling the intra - agent speech, especially when these tasks involve objects that have never been directly participated in. Specifically, the researchers explored the following two questions: 1. **Can the intra - agent speech be modeled as a machine - learning problem**, so as to make up for the lack of language data when there is a large amount of unlabeled data? 2. **Will the mechanism of learning the intra - agent speech affect the subsequent behavior of the embodied agent**? ### Main Contributions - **Proposed a semi - supervised learning method**: The researchers proposed two algorithms - generative and contrastive - for vision - based caption generation, which only require a small amount of labeled language data. - **Demonstrated zero - shot ability**: By introducing a small number of additional image captions (for example, 150), the agent can manipulate and answer questions about new objects without relevant task - guiding experience. - **Verified the influence of the intra - agent speech on behavior**: The research shows that the intra - agent speech not only helps the agent describe its perceptual content, but also significantly improves the agent's performance in tasks involving new objects. ### Research Background Human language learners are faced with a large amount of raw sensory data and relatively little social language input. Through social language use and internal rehearsal and practice, humans can construct high - level semantic representations to interpret their perceptions. Inspired by this, the researchers explored the role of the intra - agent speech in embodied experiences and attempted to apply it to artificial intelligence. ### Experimental Design 1. **Semi - supervised Caption Generation**: The researchers first tested their method in the Playhouse environment, which is a 3D virtual world containing a large number of image frames but without any labeled captions. They obtained labeled captions for 78K images as a "paired" dataset through crowdsourcing. 2. **Learning of New Objects**: To evaluate the model's ability to learn new objects, they selected a specific object (a drum) and removed all instances involving this object from the training data. Then, they gradually introduced different amounts of labeled drum data to observe the model's performance. 3. **Zero - shot Task Execution**: Finally, they tested whether the agent could complete tasks involving the drum, such as picking up the drum or answering questions about the drum's color, by learning the language describing the drum without any contact with the drum at all. ### Results The experimental results show that, compared with the method using only supervised learning, the semi - supervised method performs better on multiple indicators, especially in the case of a small amount of labeled data. In particular, in tasks involving new objects, the agent can successfully execute tasks with almost no direct interaction experience by learning the language describing the new objects. In conclusion, this paper shows that by modeling the intra - agent speech, the agent can quickly learn to describe new objects with only a small amount of labeled data, and this ability can be transformed into an actual behavioral advantage to achieve zero - shot task execution.