Formulation Comparison for Timeline Construction using LLMs

Kimihiro Hasegawa,Nikhil Kandukuri,Susan Holm,Yukari Yamakawa,Teruko Mitamura
2024-03-02
Abstract:Constructing a timeline requires identifying the chronological order of events in an article. In prior timeline construction datasets, temporal orders are typically annotated by either event-to-time anchoring or event-to-event pairwise ordering, both of which suffer from missing temporal information. To mitigate the issue, we develop a new evaluation dataset, TimeSET, consisting of single-document timelines with document-level order annotation. TimeSET features saliency-based event selection and partial ordering, which enable a practical annotation workload. Aiming to build better automatic timeline construction systems, we propose a novel evaluation framework to compare multiple task formulations with TimeSET by prompting open LLMs, i.e., Llama 2 and Flan-T5. Considering that identifying temporal orders of events is a core subtask in timeline construction, we further benchmark open LLMs on existing event temporal ordering datasets to gain a robust understanding of their capabilities. Our experiments show that (1) NLI formulation with Flan-T5 demonstrates a strong performance among others, while (2) timeline construction and event temporal ordering are still challenging tasks for few-shot LLMs. Our code and data are available at
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the identification of temporal order in timeline construction. Specifically, the author focuses on how to extract events from text and arrange these events according to their actual temporal order of occurrence. In previous studies, datasets for timeline construction were usually annotated with temporal order through the anchoring of events to time points or pairwise ranking between events, but these methods all have the problem of missing temporal information. To alleviate this problem, the author developed a new evaluation dataset - TimeSET, which contains single - document timelines at the document - level and their partial - order annotations. In addition, the author also proposed a new evaluation framework for comparing the performance of large - language models (such as Llama 2 and Flan - T5) in timeline construction under different task formulations, aiming to find out which task formulation can best stimulate the model's capabilities. ### Main contributions of the paper: 1. **Development of the TimeSET dataset**: This is a new evaluation dataset that supports context - based timeline construction, with an open license and can be publicly used. 2. **Proposing a new evaluation framework**: This framework can make comparisons across models and task formulations. Experimental results show that the natural language inference (NLI) formulation using the Flan - T5 model performs excellently in the timeline construction task. 3. **Benchmarking**: Benchmarked open large - language models on existing event - temporal - order datasets and found that large - language models with few - shot learning perform worse than small - scale fine - tuned models on some tasks. ### Key technical details: - **Dataset characteristics**: The TimeSET dataset contains articles from Wikinews. Each article has event selection and partial - order annotation at the document - level. Event selection is based on salience, and partial - order annotation reduces the annotation workload. - **Task formulations**: The paper compared four different task formulation methods: natural language inference (NLI), pairwise ranking (Pairwise), machine reading comprehension (MRC), and timeline (Timeline). Each formulation method has designed a specific prompt template to adapt to different input formats of large - language models. - **Model selection**: In the study, two series of models, Llama 2 and Flan - T5, were used, covering different architectures and model sizes to ensure the wide applicability of the results. ### Experimental results: - **Combination of NLI formulation and Flan - T5 model**: Performs best in the timeline construction task. - **Influence of model size**: Generally, larger models perform better under various task formulations, especially more obvious in the Flan - T5 series of models. - **Influence of document length**: The performance of the model decreases when dealing with longer documents, especially more obvious in task formulations that need to identify multiple orders at once (such as the timeline formulation). ### Conclusion: Timeline construction remains a challenging task, especially for large - language models with few - shot learning. The author hopes that their evaluation framework and TimeSET dataset can promote future research in this field.