Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning

Jiewen Deng,Renhe Jiang,Jiaqi Zhang,Xuan Song
2024-05-06
Abstract:Multi-modality spatio-temporal (MoST) data extends spatio-temporal (ST) data by incorporating multiple modalities, which is prevalent in monitoring systems, encompassing diverse traffic demands and air quality assessments. Despite significant strides in ST modeling in recent years, there remains a need to emphasize harnessing the potential of information from different modalities. Robust MoST forecasting is more challenging because it possesses (i) high-dimensional and complex internal structures and (ii) dynamic heterogeneity caused by temporal, spatial, and modality variations. In this study, we propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL, which aims to uncover latent patterns from temporal, spatial, and modality perspectives while quantifying dynamic heterogeneity. Experiment results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines. Model implementation is available at
Machine Learning
What problem does this paper attempt to address?
This paper focuses on the prediction problem of multi-modal spatio-temporal (MoST) data, which is widely present in monitoring systems in the real world, such as different transportation demands and air quality assessment. Compared to ordinary spatio-temporal data, MoST data contains additional modal information, which increases the complexity and challenges of prediction due to its high dimensionality, complex internal structure, and dynamic heterogeneity caused by temporal, spatial, and modal variations. The paper proposes a new MoST learning framework called MoSSL (Multi-Modality Spatio-Temporal Learning via Self-Supervised Learning) to explore latent patterns and quantify dynamic heterogeneity from the perspectives of time, space, and modality. MoSSL consists of four main parts: (1) MoST encoder for capturing spatial, temporal, and modal information; (2) multi-modal data augmentation to understand pattern correlations and integrate MoST domain information; (3) Global Self-Supervised Learning (GSSL) to identify diverse pattern changes from different perspectives; (4) Modal Self-Supervised Learning (MSSL) to further enhance the learning representation of inter-modal and intra-modal features. Experiments on two real-world MoST datasets have verified the superiority of MoSSL, demonstrating its better performance compared to existing state-of-the-art baseline models in traffic flow and air quality prediction tasks. In addition, the paper conducts ablation studies to demonstrate the contributions of key components of MoSSL to performance, and showcases the effects of modal augmentation and heterogeneity decoupling through case studies. In summary, the paper attempts to address the problem of effectively utilizing self-supervised learning to handle the prediction of multi-modal spatio-temporal data, by capturing and quantifying heterogeneity in different dimensions, and improving the accuracy and comprehensiveness of prediction.