Abstract:Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area. While tables can be serialized as input for LLMs, there is a lack of comprehensive studies on whether LLMs genuinely comprehend this data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs through seven distinct tasks, e.g., cell lookup, row retrieval and size detection. Specially, we perform a series of evaluations on the recent most advanced LLM models, GPT-3.5 and GPT-4 and observe that performance varied with different input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose $\textit{self-augmentation}$ for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data of this paper will be temporality released at https://anonymous.4open.science/r/StructuredLLM-76F3/README.md and will be replaced with an official one at <a class="link-external link-https" href="https://github.com/microsoft/TableProvider" rel="external noopener nofollow">this https URL</a> later.

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

A NotSo Simple Way to Beat Simple Bench

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Evaluating Mathematical Reasoning Beyond Accuracy

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Struct-X: Enhancing Large Language Models Reasoning with Structured Data

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Reasoning Factual Knowledge in Structured Data with Large Language Models

Structured Chemistry Reasoning with Large Language Models

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations