Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction

Junzhi Wen,Rafal A. Angryk
2024-05-31
Abstract:Time series data plays a crucial role across various domains, making it valuable for decision-making and predictive modeling. Machine learning (ML) and deep learning (DL) have shown promise in this regard, yet their performance hinges on data quality and quantity, often constrained by data scarcity and class imbalance, particularly for rare events like solar flares. Data augmentation techniques offer a potential solution to address these challenges, yet their effectiveness on multivariate time series datasets remains underexplored. In this study, we propose a novel data augmentation method for time series data named Mean Gaussian Noise (MGN). We investigate the performance of MGN compared to eight existing basic data augmentation methods on a multivariate time series dataset for solar flare prediction, SWAN-SF, using a ML algorithm for time series data, TimeSeriesSVC. The results demonstrate the efficacy of MGN and highlight its potential for improving classification performance in scenarios with extremely imbalanced data. Our time complexity analysis shows that MGN also has a competitive computational cost compared to the investigated alternative methods.
Machine Learning,Instrumentation and Methods for Astrophysics,Solar and Stellar Astrophysics,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of poor machine learning model performance in solar flare prediction due to the extreme imbalance in multivariate time series data. Specifically, because solar flare events are very rare, the number of these events in the dataset is far less than that of other categories, leading to a class imbalance problem. Traditional data augmentation methods have limited effectiveness in handling such multivariate time series data. Therefore, the authors propose a new data augmentation method—Mean Gaussian Noise (MGN)—to improve the classification performance of models on extremely imbalanced datasets. The main contributions of the paper include: 1. **Proposing the MGN method**: By using the time series mean of the entire dataset to generate synthetic data, it globally enhances the representation of minority class data. 2. **Experimental validation**: On the SWAN-SF dataset, MGN was compared with eight existing basic data augmentation methods, and the results showed that MGN performed excellently in improving classification performance. 3. **Computational complexity analysis**: MGN not only has performance advantages but also is competitive in terms of computational cost. Through this research, the paper aims to provide an effective method for addressing the class imbalance problem in multivariate time series data, especially when dealing with rare events such as solar flare prediction.