Abstract:This paper presents $\forall$uto$\exists$$\lor\!\land$L, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. $\forall$uto$\exists$$\lor\!\land$L is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on $\forall$uto$\exists$$\lor\!\land$L is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the correctness - maintaining and reasoning abilities of large - language models (LLMs) in formal tasks, especially in the conversion between natural language (NL) and formal syntax (FS). Specifically, the paper aims to solve the following three key issues: 1. **Ability to Dynamically Generate Datasets** (D1): Can a dataset out of distribution be dynamically generated without relying on manual annotation? 2. **Accurately Evaluate the Correctness - Maintaining Ability of LLM** (D2): How can the correctness - maintaining ability of LLM in translation and reasoning tasks be accurately evaluated? 3. **Predict the Performance of LLM in Other Tasks** (D3): Can our evaluation metrics be used as performance predictors of LLM in FS - based tasks? ### Detailed Explanation #### 1. Ability to Dynamically Generate Datasets (D1) Existing benchmark tests usually use static datasets, which may lead to model over - fitting, thus affecting the objectivity and reliability of the evaluation. The paper proposes a new method that uses context - free grammars (CFGs) to automatically generate balanced, out - of - distribution datasets, ensuring that these datasets will not be remembered or seen by LLM during the training process. #### 2. Accurately Evaluate the Correctness - Maintaining Ability of LLM (D2) To evaluate the correctness - maintaining ability of LLM in translation and reasoning tasks, the paper introduces a closed - loop test and uses a formal verifier to automatically evaluate the performance of LLM. In this way, reliance on expensive and time - consuming manual annotation can be avoided and the accuracy of the evaluation can be ensured. #### 3. Predict the Performance of LLM in Other Tasks (D3) The paper shows that its proposed evaluation metrics can be used as performance predictors of LLM in other tasks (such as first - order logical reasoning, etc.). By comparing with multiple existing benchmarks, the paper proves the effectiveness and wide applicability of its evaluation system. ### Main Contributions of the Paper - Proposes a new dynamic method for automatically generating balanced test datasets that are unlikely to be remembered or seen during the training process of LLM. - Uses formal verifiers such as theorem provers to verify the concept of syntax - independent correctness without comprehensively testing all possible formal - syntax logical values. - Introduces a scalable evaluation system ∀uto∃∨∧L for evaluating newly - developed LLM. - Proves that the performance of LLM on its metrics can be an effective indicator of performance in other tasks, especially in the absence of new datasets. ### Conclusion By proposing the new benchmark - testing framework ∀uto∃∨∧L, the paper solves the over - fitting problem of static datasets in existing evaluation methods and provides an automated, objective evaluation method that can more accurately evaluate the correctness - maintaining and reasoning abilities of LLM in formal tasks.

$\forall$uto$\exists$$\lor\!\land$L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

LLMs for Relational Reasoning: How Far are We?

Automated Theorem Provers Help Improve Large Language Model Reasoning

Easy Problems That LLMs Get Wrong

An Evaluation Benchmark for Autoformalization in Lean4

Beyond LLMs: Advancing the Landscape of Complex Reasoning

$\texttt{ACCORD}$: Closing the Commonsense Measurability Gap

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic

A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models