Abstract:Large Language Models (LLMs) have recently demonstrated a remarkable ability to model time series data. These capabilities can be partly explained if LLMs understand basic time series concepts. However, our knowledge of what these models understand about time series data remains relatively limited. To address this gap, we introduce TimeSeriesExam, a configurable and scalable multiple-choice question exam designed to assess LLMs across five core time series understanding categories: pattern recognition, noise understanding, similarity analysis, anomaly detection, and causality analysis. TimeSeriesExam comprises of over 700 questions, procedurally generated using 104 carefully curated templates and iteratively refined to balance difficulty and their ability to discriminate good from bad models. We test 7 state-of-the-art LLMs on the TimeSeriesExam and provide the first comprehensive evaluation of their time series understanding abilities. Our results suggest that closed-source models such as GPT-4 and Gemini understand simple time series concepts significantly better than their open-source counterparts, while all models struggle with complex concepts such as causality analysis. We believe that the ability to programatically generate questions is fundamental to assessing and improving LLM's ability to understand and reason about time series data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the understanding ability of large - language models (LLMs) for the basic concepts of time - series data. Although LLMs perform well in modeling time - series data, little is currently known about how much these models actually understand the basic concepts of time - series. To this end, the author introduced **TimeSeriesExam**, a configurable and extensible multiple - choice exam designed to evaluate the performance of LLMs in five core time - series understanding categories: Pattern Recognition, Noise Understanding, Similarity Analysis, Anomaly Detection, and Causality Analysis. Specifically, the goals of the paper include: 1. **Design an evaluation tool**: By constructing **TimeSeriesExam** which contains more than 700 questions, these questions are generated through 104 carefully curated templates and iteratively optimized to balance difficulty and discrimination. 2. **Evaluate the performance of existing models**: Test 7 state - of - the - art LLMs (including closed - source and open - source models) to provide the first comprehensive evaluation of their time - series understanding ability. 3. **Reveal the ability gap of models**: The study found that closed - source models such as GPT - 4 and Gemini are significantly better than open - source models in understanding simple time - series concepts, and all models have difficulties in complex concepts such as causality analysis. 4. **Propose improvement directions**: Emphasize the importance of programmatically generating questions for evaluating and improving the ability of LLMs to understand and reason about time - series data. Through these goals, the paper aims to fill the current gap in evaluating the time - series understanding ability of LLMs and provide directions for future research and development.

TimeSeriesExam: A time series understanding exam

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark

Will code one day run a code? Performance of language models on ACEM primary examinations and implications

Large Language Models for Time Series: A Survey

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities

Position: What Can Large Language Models Tell Us about Time Series Analysis

Can LLMs Understand Time Series Anomalies?

Empowering Time Series Analysis with Large Language Models: A Survey

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Language Models Still Struggle to Zero-shot Reason about Time Series

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

Can LLMs Serve As Time Series Anomaly Detectors?

Large Language Models Are Zero-Shot Time Series Forecasters

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

Revisited Large Language Model for Time Series Analysis through Modality Alignment

Towards Time Series Reasoning with LLMs

A Survey of Time Series Foundation Models: Generalizing Time Series Representation with Large Language Model

TableTime: Reformulating Time Series Classification as Zero-Shot Table Understanding via Large Language Models

An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting