Abstract:Against the backdrop of enthusiasm for large language models (LLMs), there is an urgent need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer different-grained semantic-level questions of the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.

Controlling Equational Reasoning in Large Language Models with Prompt Interventions

Evaluating Interventional Reasoning Capabilities of Large Language Models

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Hint Marginalization for Improved Reasoning in Large Language Models

MathPrompter: Mathematical Reasoning using Large Language Models

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

Investigating Symbolic Capabilities of Large Language Models

Neuro-Symbolic Data Generation for Math Reasoning

Reasoning with Large Language Models, a Survey

FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

Sources of Hallucination by Large Language Models on Inference Tasks

Large Language Models Are Unconscious of Unreasonability in Math Problems

Can Large Language Models Understand Symbolic Graphics Programs?

Rational Metareasoning for Large Language Models

Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control