TRAM: Benchmarking Temporal Reasoning for Large Language Models

Yuqing Wang,Yun Zhao
2024-05-31
Abstract:Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the TeR capabilities of large language models (LLMs). We evaluate popular LLMs like GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish baselines with BERT-based and domain-specific models. Our findings indicate that the best-performing model lags significantly behind human performance. It is our aspiration that TRAM will spur further progress in enhancing the TeR capabilities of LLMs.
Computation and Language
What problem does this paper attempt to address?
This paper focuses on the evaluation of temporal reasoning (TeR) capabilities in natural language processing. Existing research lacks a unified benchmark for this field, making it difficult to compare different studies. To address this, the authors propose a benchmark dataset called TRAM (Temporal Reasoning for large Language Model), which includes 10 time-related tasks to comprehensively evaluate the temporal reasoning abilities of large language models (LLMs). These tasks cover different aspects such as event sequences, frequencies, durations, arithmetic, and causal relationships. Using this benchmark, the authors evaluate popular LLMs including GPT-4 and Llama2 in zero-shot and few-shot learning scenarios, and establish baselines with BERT-based and domain-specific models. The results show that while GPT-4 performs well in most tasks, its performance still significantly lags behind humans, indicating considerable room for improvement in LLMs' temporal reasoning abilities. The paper also analyzes the difficulties in understanding subtle differences and parsing implicit temporal cues, and highlights the need for further research to enhance LLMs' performance in these aspects. The goal of TRAM is to stimulate researchers to develop LLMs that better understand and reason about temporal information, aiming for performance closer to human-level.