StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Hailin Chen,Fangkai Jiao,Mathieu Ravaut,Nawshad Farruque,Xuan Phi Nguyen,Chengwei Qin,Manan Dey,Bosheng Ding,Caiming Xiong,Shafiq Joty,Yingbo Zhou
2024-12-24
Abstract:The rapid development of large language models (LLMs) necessitates robust, unbiased, and scalable methods for evaluating their capabilities. However, human annotations are expensive to scale, model-based evaluations are prone to biases in answer style, while target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to produce compositionally specified structured outputs as an unbiased, cheap-to-run and difficult-to-cheat measure. The evaluation is done deterministically by a rule-based evaluator, which can be easily extended to new tasks. By testing structured outputs across diverse task domains -- including Summarization, Code, HTML and Math -- we demonstrate that StructTest serves as a good proxy for general reasoning abilities, as producing structured outputs often requires internal logical reasoning. We believe that StructTest offers a critical, complementary approach to objective and robust model evaluation.
Computation and Language
What problem does this paper attempt to address?