Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi,Mehran Kazemi,Anton Tsitsulin,Karishma Malkan,Jinyeong Yim,John Palowitch,Sungyong Seo,Jonathan Halcrow,Bryan Perozzi

2024-06-13

Abstract:Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: <a class="link-external link-https" href="https://huggingface.co/datasets/baharef/ToT" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

This paper mainly focuses on the performance issues of large-scale language models (LLMs) in temporal reasoning tasks. Existing research has explored the performance of LLMs in temporal reasoning, but often relies on real-world data that may be encountered during pre-training, or uses anonymization techniques that may lead to inconsistency. The paper proposes a new approach to more comprehensively and controllably evaluate the temporal reasoning ability of LLMs by creating specially designed synthetic datasets. These datasets contain various types of questions and aim to independently assess the model's performance in understanding temporal semantics and performing temporal arithmetic. The paper also points out that the limitations of existing benchmarks are excessive focus on time facts based on knowledge graphs, while ignoring the complex temporal structures and reasoning tasks in real-world applications. Through experiments on these new tasks, the paper provides insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks and open sources the relevant datasets and evaluation frameworks to promote further research.

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Large Language Models Can Learn Temporal Reasoning

MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

Enhancing Temporal Understanding in LLMs for Semi-structured Tables

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models

TRAM: Benchmarking Temporal Reasoning for Large Language Models

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Benchmarking Large Language Models for Math Reasoning Tasks

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Can LLMs Reason in the Wild with Programs?

Timo: Towards Better Temporal Reasoning for Language Models

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations