LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models

Md Fahim Anjum
2024-08-14
Abstract:Language models have achieved remarkable success in various natural language processing tasks. However, their application to time series data, a crucial component in many domains, remains limited. This paper proposes LiPCoT (Linear Predictive Coding based Tokenizer for time series), a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT. Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data. Furthermore, LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers. In this proof-of-concept work, we present the effectiveness of LiPCoT in classifying Parkinson's disease (PD) using an EEG dataset from 46 participants. In particular, we utilize LiPCoT to encode EEG data into a small vocabulary of tokens and then use BERT for self-supervised learning and the downstream task of PD classification. We benchmark our approach against several state-of-the-art CNN-based deep learning architectures for PD detection. Our results reveal that BERT models utilizing self-supervised learning outperformed the best-performing existing method by 7.1% in precision, 2.3% in recall, 5.5% in accuracy, 4% in AUC, and 5% in F1-score highlighting the potential for self-supervised learning even on small datasets. Our work will inform future foundational models for time series, particularly for self-supervised learning.
Machine Learning,Artificial Intelligence,Signal Processing
What problem does this paper attempt to address?
The paper aims to address key issues in time series data analysis, particularly how to effectively utilize deep learning techniques, such as language models, for self-supervised learning of time series data. Specifically, the paper proposes a new method called LiPCoT (Linear Predictive Coding based Tokenizer) to convert time series data into a series of discrete "tokens" or symbols, thereby enabling existing language model architectures (e.g., BERT) to be applied to time series data. The main contributions of LiPCoT include: 1. **Addressing the limitations of traditional methods**: Traditional convolutional neural network (CNN)-based time series analysis methods often require substantial computational resources and are inadequate in handling long-range dependencies. Moreover, these methods are not directly applicable to self-supervised learning. 2. **Innovative tokenization method**: LiPCoT creates latent representations of time series data through Linear Predictive Coding (LPC), a method that captures the inherent stochastic nature of time series data and is independent of specific sampling rates or data lengths. 3. **Efficiency and versatility**: LiPCoT is not only computationally efficient but also capable of handling time series data with different sampling rates and lengths. 4. **Empirical study**: The paper demonstrates the effectiveness of LiPCoT through a case study on classifying Parkinson's Disease (PD) using electroencephalogram (EEG) data. Experimental results show that using the BERT model combined with self-supervised learning methods outperforms several existing deep learning architectures in terms of precision, recall, accuracy, AUC, and F1 score. In summary, the paper presents a novel approach to overcoming common challenges in time series data processing and demonstrates its potential in practical applications, particularly in the healthcare domain, such as the diagnosis of Parkinson's Disease.