Abstract:Constructing a timeline requires identifying the chronological order of events in an article. In prior timeline construction datasets, temporal orders are typically annotated by either event-to-time anchoring or event-to-event pairwise ordering, both of which suffer from missing temporal information. To mitigate the issue, we develop a new evaluation dataset, TimeSET, consisting of single-document timelines with document-level order annotation. TimeSET features saliency-based event selection and partial ordering, which enable a practical annotation workload. Aiming to build better automatic timeline construction systems, we propose a novel evaluation framework to compare multiple task formulations with TimeSET by prompting open LLMs, i.e., Llama 2 and Flan-T5. Considering that identifying temporal orders of events is a core subtask in timeline construction, we further benchmark open LLMs on existing event temporal ordering datasets to gain a robust understanding of their capabilities. Our experiments show that (1) NLI formulation with Flan-T5 demonstrates a strong performance among others, while (2) timeline construction and event temporal ordering are still challenging tasks for few-shot LLMs. Our code and data are available at

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the identification of temporal order in timeline construction. Specifically, the author focuses on how to extract events from text and arrange these events according to their actual temporal order of occurrence. In previous studies, datasets for timeline construction were usually annotated with temporal order through the anchoring of events to time points or pairwise ranking between events, but these methods all have the problem of missing temporal information. To alleviate this problem, the author developed a new evaluation dataset - TimeSET, which contains single - document timelines at the document - level and their partial - order annotations. In addition, the author also proposed a new evaluation framework for comparing the performance of large - language models (such as Llama 2 and Flan - T5) in timeline construction under different task formulations, aiming to find out which task formulation can best stimulate the model's capabilities. ### Main contributions of the paper: 1. **Development of the TimeSET dataset**: This is a new evaluation dataset that supports context - based timeline construction, with an open license and can be publicly used. 2. **Proposing a new evaluation framework**: This framework can make comparisons across models and task formulations. Experimental results show that the natural language inference (NLI) formulation using the Flan - T5 model performs excellently in the timeline construction task. 3. **Benchmarking**: Benchmarked open large - language models on existing event - temporal - order datasets and found that large - language models with few - shot learning perform worse than small - scale fine - tuned models on some tasks. ### Key technical details: - **Dataset characteristics**: The TimeSET dataset contains articles from Wikinews. Each article has event selection and partial - order annotation at the document - level. Event selection is based on salience, and partial - order annotation reduces the annotation workload. - **Task formulations**: The paper compared four different task formulation methods: natural language inference (NLI), pairwise ranking (Pairwise), machine reading comprehension (MRC), and timeline (Timeline). Each formulation method has designed a specific prompt template to adapt to different input formats of large - language models. - **Model selection**: In the study, two series of models, Llama 2 and Flan - T5, were used, covering different architectures and model sizes to ensure the wide applicability of the results. ### Experimental results: - **Combination of NLI formulation and Flan - T5 model**: Performs best in the timeline construction task. - **Influence of model size**: Generally, larger models perform better under various task formulations, especially more obvious in the Flan - T5 series of models. - **Influence of document length**: The performance of the model decreases when dealing with longer documents, especially more obvious in task formulations that need to identify multiple orders at once (such as the timeline formulation). ### Conclusion: Timeline construction remains a challenging task, especially for large - language models with few - shot learning. The author hopes that their evaluation framework and TimeSET dataset can promote future research in this field.

Formulation Comparison for Timeline Construction using LLMs

TLEX: An Efficient Method for Extracting Exact Timelines from TimeML Temporal Graphs

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

LITA: Language Instructed Temporal-Localization Assistant

Are Large Language Models Temporally Grounded?

A Temporally Sensitive Submodularity Framework for Timeline Summarization

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Timo: Towards Better Temporal Reasoning for Language Models

Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

TempCompass: Do Video LLMs Really Understand Videos?

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Joint Inference for Event Timeline Construction

TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the Automatic Ordering of Events in News Articles

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Improve Temporal Awareness of LLMs for Sequential Recommendation

Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion

DTELS: Towards Dynamic Granularity of Timeline Summarization

Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning