Abstract:Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.

EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Enhancing Quantitative Reasoning Skills of Large Language Models Through Dimension Perception

Solving Quantitative Reasoning Problems with Language Models

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios

Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models

Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks

Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems

NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset

Numerical Reasoning for Financial Reports

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

Linguini: A benchmark for language-agnostic linguistic reasoning

MiQA: A Benchmark for Inference on Metaphorical Questions

Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Evaluating Mathematical Reasoning Beyond Accuracy

Reasoning Elicitation in Language Models via Counterfactual Feedback

A quantitative study of NLP approaches to question difficulty estimation