Investigating Symbolic Capabilities of Large Language Models

Neisarg Dave,Daniel Kifer,C. Lee Giles,Ankur Mali
2024-05-22
Abstract:Prompting techniques have significantly enhanced the capabilities of Large Language Models (LLMs) across various complex tasks, including reasoning, planning, and solving math word problems. However, most research has predominantly focused on language-based reasoning and word problems, often overlooking the potential of LLMs in handling symbol-based calculations and reasoning. This study aims to bridge this gap by rigorously evaluating LLMs on a series of symbolic tasks, such as addition, multiplication, modulus arithmetic, numerical precision, and symbolic counting. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The assessment framework is anchored in Chomsky's Hierarchy, providing a robust measure of the computational abilities of these models. The evaluation employs minimally explained prompts alongside the zero-shot Chain of Thoughts technique, allowing models to navigate the solution process autonomously. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases. Notably, even the fine-tuned GPT3.5 exhibits only marginal improvements, mirroring the performance trends observed in other models. Across the board, all models demonstrated a limited generalization ability on these symbol-intensive tasks. This research underscores LLMs' challenges with increasing symbolic complexity and highlights the need for specialized training, memory and architectural adjustments to enhance their proficiency in symbol-based reasoning tasks.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in handling symbolic tasks. Specifically, the researchers focus on the performance of these models in symbolic tasks such as addition, multiplication, modulo operation, numerical precision, and symbol counting. The paper points out that although most of the existing research mainly focuses on language reasoning and word problems, few have explored the potential of LLMs in handling symbol - based calculation and reasoning. Therefore, this article aims to fill this gap through a series of rigorous experiments to evaluate the performance of different LLMs on these tasks and explore their performance when facing an increase in symbolic complexity. The research adopted eight different LLMs, including four enterprise - level models and four open - source models, among which three models have been pre - trained for math tasks. The evaluation framework is based on the Chomsky hierarchy and provides a robust method for measuring the computational capabilities of these models. In the experiments, minimally - explained prompts and zero - shot chain - of - thought techniques were used to enable the models to autonomously navigate the solution process. The research results show that as the task complexity increases, especially the number of symbols, the performance of LLMs on context - free and context - sensitive symbolic tasks decreases significantly. Even the fine - tuned GPT3.5 model only shows marginal improvement, which reflects the performance trends observed in other models. Overall, all models have limited generalization ability in such symbol - intensive tasks. In addition, the research also emphasizes the challenges of LLMs when facing increasingly complex symbolic tasks and points out that in order to improve their proficiency in symbol - based reasoning tasks, special training, memory, and architecture adjustments are required.