Fine-Tuning a Time Series Foundation Model with Wasserstein Loss

Andrei Chernov
2024-09-19
Abstract:Inspired by recent advancements in large language models (LLMs) for Natural Language Processing (NLP), there has been a surge in research focused on developing foundational models for time series forecasting. One approach involves training LLM architectures on tokenized time series data using cross-entropy loss. Although this method has demonstrated promising results, cross-entropy loss is primarily designed for classification tasks and does not account for the distance between classes. To address this limitation, we propose using the Wasserstein loss for such architectures. To validate our approach, we fine-tuned a foundational time series model on $22$ zero-shot datasets, comparing the performance of cross-entropy loss with that of Wasserstein loss. Our results demonstrate that replacing cross-entropy loss with Wasserstein loss significantly improves point estimation.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of how to improve the performance of time series models based on large language model (LLM) architectures using Wasserstein loss in time series forecasting. Specifically, the paper points out that while traditional cross-entropy loss is effective for classification tasks, it has limitations in regression tasks like time series forecasting because it ignores the distance information between categories. To solve this problem, the authors propose using Wasserstein loss to replace cross-entropy loss to improve the point estimation accuracy in time series forecasting. Through fine-tuning experiments on multiple zero-shot datasets, the significant advantage of Wasserstein loss over cross-entropy loss in point estimation was validated. However, this method performs slightly worse than cross-entropy loss in probabilistic forecasting. Future research directions include training time series base models from scratch using Wasserstein loss and exploring more complex distribution assumptions to improve probabilistic forecasting performance.