Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification

Xiaoyu Tao,Tingyue Pan,Mingyue Cheng,Yucong Luo
2024-10-24
Abstract:Leveraging large language models (LLMs) has garnered increasing attention and introduced novel perspectives in time series classification. However, existing approaches often overlook the crucial dynamic temporal information inherent in time series data and face challenges in aligning this data with textual semantics. To address these limitations, we propose HiTime, a hierarchical multi-modal model that seamlessly integrates temporal information into LLMs for multivariate time series classification (MTSC). Our model employs a hierarchical feature encoder to capture diverse aspects of time series data through both data-specific and task-specific embeddings. To facilitate semantic space alignment between time series and text, we introduce a dual-view contrastive alignment module that bridges the gap between modalities. Additionally, we adopt a hybrid prompting strategy to fine-tune the pre-trained LLM in a parameter-efficient manner. By effectively incorporating dynamic temporal features and ensuring semantic alignment, HiTime enables LLMs to process continuous time series data and achieves state-of-the-art classification performance through text generation. Extensive experiments on benchmark datasets demonstrate that HiTime significantly enhances time series classification accuracy compared to most competitive baseline methods. Our findings highlight the potential of integrating temporal features into LLMs, paving the way for advanced time series analysis. The code is publicly available for further research and validation. Our codes are publicly available1.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges faced by existing time - series classification methods when dealing with multivariate time - series data: 1. **Neglect of dynamic time information**: Existing time - series classification methods based on large language models (LLMs) often overlook the rich dynamic time information inherent in time - series data. These models usually rely on discrete text tokens and cannot fully capture the complex dynamic features in time - series. 2. **Difficulty in semantic alignment between modalities**: There are challenges in the semantic alignment between time - series data and text representations, which may cause the model to fail to fully capture the time - dependencies crucial for accurate classification. This misalignment will degrade the model performance as they do not fully utilize the dynamic features in time - series data. To solve these problems, the authors propose the HiTime model, which is a hierarchical multimodal model and improves time - series classification in the following ways: - **Hierarchical feature encoding**: A hierarchical feature encoder is adopted to extract multi - level feature representations from time - series data, including data - specific and task - specific embeddings. This ensures that the model can retain the key dynamic characteristics of the time - series and adapt to specific classification tasks. - **Dual - view contrastive alignment module**: A dual - view contrastive alignment module is introduced to bridge the semantic gap between time - series data and text information. By aligning time - series and text embeddings in the shared latent space, the understanding and generation ability of the model is improved. - **Mixed - prompt strategy**: A mixed - prompt strategy is used to perform parameter - efficient fine - tuning on the pre - trained LLM, enabling it to handle continuous time - series data and achieve accurate classification outputs through text generation. Through these innovations, HiTime not only effectively integrates dynamic time features but also ensures effective semantic alignment between time - series and text modalities, thereby significantly improving the accuracy of time - series classification. ### Formula summary - **Embedding concatenation for hierarchical feature encoding**: \[ Z=\text{Concat}[\text{Encoder}_d(X), \text{Encoder}_s(X)] \] where \(X\) is the input instance, \(\text{Encoder}_d(\cdot)\) and \(\text{Encoder}_s(\cdot)\) are the data - specific and task - specific encoders respectively, \(\text{Concat}(\cdot)\) is the concatenation operation, and \(Z\) is the encoder output after concatenation. - **Fine - grained alignment loss**: \[ L_{\text{fine}} = -\frac{1}{|D|}\left(\sum_{(e_c, e_t)\in D^+}\log\hat{y}+\sum_{(e_c, e_t)\in D^-}\log(1 - \hat{y})\right) \] where \(\hat{y}=F_c(e_c\oplus e_t)\), \(\oplus\) represents the concatenation operation, and \(F_c(\cdot)\) is a learnable mapping function that projects the concatenated vector into a 1x1 probability space. - **Coarse - grained alignment loss**: \[ L_{\text{coarse}}=-\frac{1}{|D|}\left(\sum_{(e_c, e_t)\in D^+}\log F(e_c, e_t)+\sum_{(e_c, e_t)\in D^-}\log\left(1 - F(e_c, e_t)\right)\right) \] where \(F(e_c, e_t)=\text{Sigmoid}(e_c e_t^T)\). - **Total loss function**: \[ L=\alpha L_{\text{coarse}}+\beta L_{\text{fine}} \] where \(\alpha\) and \(\beta\)