Abstract:Large language models (LLMs) have significantly impacted many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs' non-parametric chronological knowledge. Our evaluation shows: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), demonstrating its effectiveness in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

Time-Aware Language Models as Temporal Knowledge Bases

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Temporal Blind Spots in Large Language Models

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark

Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

TRAM: Benchmarking Temporal Reasoning for Large Language Models

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models

Are Large Language Models Temporally Grounded?

Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

Revisited Large Language Model for Time Series Analysis through Modality Alignment

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

It's About Time: Incorporating Temporality in Retrieval Augmented Language Models

Assessing the Reliability of Large Language Model Knowledge