What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu
2023-10-25
Abstract:Recent works such as VisProg and ViperGPT have smartly composed foundation models for visual reasoning-using large language models (LLMs) to produce programs that can be executed by pre-trained vision-language models. However, they operate in limited domains, such as 2D images, not fully exploiting the generalization of language: abstract concepts like "left" can also be grounded in 3D, temporal, and action data, as in moving to your left. This limited generalization stems from these inference-only methods' inability to learn or adapt pre-trained models to a new domain. We propose the Logic-Enhanced Foundation Model (LEFT), a unified framework that learns to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor. LEFT has an LLM interpreter that outputs a program represented in a general, logic-based reasoning language, which is shared across all domains and tasks. LEFT's executor then executes the program with trainable domain-specific grounding modules. We show that LEFT flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, and robotic manipulation. It exhibits strong reasoning ability in a wide variety of tasks, including those that are complex and not seen during training, and can be easily applied to new domains.
Artificial Intelligence,Machine Learning,Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve a general framework for concept learning and reasoning in different fields. Although existing visual reasoning systems such as VisProg and ViperGPT can generate programs using large - language models (LLMs) and execute these programs through pre - trained vision - language models to complete tasks, they are mainly limited to the 2D image field and cannot work effectively in fields such as 3D scenes, time series, human actions or robot operations. This is mainly because these methods only rely on reasoning and cannot learn or adapt to pre - trained models in new fields. To solve these problems, the authors propose the Logic - Enhanced Foundation Model (LEFT), which is a unified framework aiming at learning and reasoning concepts across fields. The key innovation points of LEFT are as follows: 1. **Cross - field learning and reasoning**: LEFT can learn and reason about the specific meanings of abstract concepts (such as "left") in different fields, including 2D images, 3D scenes, human actions and robot operations. 2. **Trainable concept grounding module**: Different from existing methods that only rely on pre - trained models for reasoning, LEFT contains trainable concept grounding modules, which can learn the specific representations of concepts in specific fields. 3. **Differentiable first - order logic executor**: LEFT uses a differentiable first - order logic executor to execute the logic programs generated by LLM, which enables the entire system to be trained end - to - end through back - propagation, so as to better adapt and generalize in different fields. Through these innovations, LEFT not only performs well in multiple fields, but also can perform zero - sample generalization on unseen complex tasks without the need to define specific programs or languages for each field.