Abstract:Temporal reasoning is a crucial NLP task, providing a nuanced understanding of time-sensitive contexts within textual data. Although recent advancements in LLMs have demonstrated their potential in temporal reasoning, the predominant focus has been on tasks such as temporal expression and temporal relation extraction. These tasks are primarily designed for the extraction of direct and past temporal cues and to engage in simple reasoning processes. A significant gap remains when considering complex reasoning tasks such as event forecasting, which requires multi-step temporal reasoning on events and prediction on the future timestamp. Another notable limitation of existing methods is their incapability to provide an illustration of their reasoning process, hindering explainability. In this paper, we introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context which requires multiple reasoning over multiple events, and subsequently provide a clear explanation for their prediction. Our task offers a comprehensive evaluation of both the LLMs' complex temporal reasoning ability, the future event prediction ability, and explainability-a critical attribute for AI applications. To support this task, we present the first multi-source instruction-tuning dataset of explainable temporal reasoning (ExpTime) with 26k derived from the temporal knowledge graph datasets and their temporal reasoning paths, using a novel knowledge-graph-instructed-generation strategy. Based on the dataset, we propose the first open-source LLM series TimeLlaMA based on the foundation LlaMA2, with the ability of instruction following for explainable temporal reasoning. We compare the performance of our method and a variety of LLMs, where our method achieves the state-of-the-art performance of temporal prediction and explanation.

TRAM: Benchmarking Temporal Reasoning for Large Language Models

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Large Language Models Can Learn Temporal Reasoning

MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models

Timo: Towards Better Temporal Reasoning for Language Models

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives

TimeR4 : Time-aware Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs