Abstract:Recent works such as VisProg and ViperGPT have smartly composed foundation models for visual reasoning-using large language models (LLMs) to produce programs that can be executed by pre-trained vision-language models. However, they operate in limited domains, such as 2D images, not fully exploiting the generalization of language: abstract concepts like "left" can also be grounded in 3D, temporal, and action data, as in moving to your left. This limited generalization stems from these inference-only methods' inability to learn or adapt pre-trained models to a new domain. We propose the Logic-Enhanced Foundation Model (LEFT), a unified framework that learns to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor. LEFT has an LLM interpreter that outputs a program represented in a general, logic-based reasoning language, which is shared across all domains and tasks. LEFT's executor then executes the program with trainable domain-specific grounding modules. We show that LEFT flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, and robotic manipulation. It exhibits strong reasoning ability in a wide variety of tasks, including those that are complex and not seen during training, and can be easily applied to new domains.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve a general framework for concept learning and reasoning in different fields. Although existing visual reasoning systems such as VisProg and ViperGPT can generate programs using large - language models (LLMs) and execute these programs through pre - trained vision - language models to complete tasks, they are mainly limited to the 2D image field and cannot work effectively in fields such as 3D scenes, time series, human actions or robot operations. This is mainly because these methods only rely on reasoning and cannot learn or adapt to pre - trained models in new fields. To solve these problems, the authors propose the Logic - Enhanced Foundation Model (LEFT), which is a unified framework aiming at learning and reasoning concepts across fields. The key innovation points of LEFT are as follows: 1. **Cross - field learning and reasoning**: LEFT can learn and reason about the specific meanings of abstract concepts (such as "left") in different fields, including 2D images, 3D scenes, human actions and robot operations. 2. **Trainable concept grounding module**: Different from existing methods that only rely on pre - trained models for reasoning, LEFT contains trainable concept grounding modules, which can learn the specific representations of concepts in specific fields. 3. **Differentiable first - order logic executor**: LEFT uses a differentiable first - order logic executor to execute the logic programs generated by LLM, which enables the entire system to be trained end - to - end through back - propagation, so as to better adapt and generalize in different fields. Through these innovations, LEFT not only performs well in multiple fields, but also can perform zero - sample generalization on unseen complex tasks without the need to define specific programs or languages for each field.

What's Left? Concept Grounding with Logic-Enhanced Foundation Models

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning

Beyond LLMs: Advancing the Landscape of Complex Reasoning

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Large Language Models are Visual Reasoning Coordinators

Beyond Logic Programming for Legal Reasoning

Zero, Finite, and Infinite Belief History of Theory of Mind Reasoning in Large Language Models

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

Training Large Language Models to Reason in a Continuous Latent Space

Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach

Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought

Position: Foundation Agents as the Paradigm Shift for Decision Making

Grounding Large Language Models In Embodied Environment With Imperfect World Models

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Can LLMs Reason in the Wild with Programs?

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning