Towards Gradient-based Time-Series Explanations through a SpatioTemporal Attention Network

Min Hun Lee
2024-05-18
Abstract:In this paper, we explore the feasibility of using a transformer-based, spatiotemporal attention network (STAN) for gradient-based time-series explanations. First, we trained the STAN model for video classifications using the global and local views of data and weakly supervised labels on time-series data (i.e. the type of an activity). We then leveraged a gradient-based XAI technique (e.g. saliency map) to identify salient frames of time-series data. According to the experiments using the datasets of four medically relevant activities, the STAN model demonstrated its potential to identify important frames of videos.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use the Transformer - based Spatio - Temporal Attention Network (STAN) for gradient - based time - series interpretation to identify important frames in videos. Specifically, the paper aims to: 1. **Improve the interpretability of time - series data**: By introducing a new method to explain the behavior of complex AI models when processing time - series data (such as videos), especially to identify video frames that are crucial for activity classification. 2. **Verify the effectiveness of the STAN model**: Evaluate the performance of the STAN model in video classification tasks and explore its performance under different views (global view, local view, and global + local view). 3. **Explore the application of gradient - based explanation techniques**: Study the effects of three gradient - based XAI (Explainable AI) techniques (Vanilla Gradient, SmoothGrad, and GradCAM) in time - series interpretation and compare their performance on short - sequence and long - sequence videos. ### Specific problem description - **Video classification task**: The paper first trains an STAN model for video classification, using weakly - supervised labels (i.e., activity types) that include global and local views. The purpose is to verify the performance of the STAN model in video classification tasks. - **Time - series interpretation**: Then, the paper uses gradient - based XAI techniques (such as saliency maps) to calculate the gradient of the loss function with respect to the input data, thereby identifying important frames in the video. This is to provide an explanation of the model's decision - making and help users understand why the model makes a particular prediction. ### Main contributions - **Propose a new framework**: Apply the Transformer - based Spatio - Temporal Attention Network to time - series interpretation, especially in medical - related activities. - **Verify the effectiveness of multi - view**: Research shows that combining global and local views can significantly improve the performance of video classification and time - series interpretation. - **Explore the effect of gradient - based explanation techniques**: Experimental results show that for shorter video sequences, the STAN model combined with gradient - based explanation techniques can achieve better interpretation results; but for longer video sequences, traditional CNN models may be more effective. ### Research background In recent years, with the wide application of AI in decision - making tasks, especially in high - risk fields (such as medicine), it has become crucial to explain the decisions of AI models. Although existing XAI techniques have made progress on image and tabular data, there are still challenges in the interpretation of time - series data. For this reason, this paper proposes a Transformer - based Spatio - Temporal Attention Network (STAN) and uses gradient - based explanation techniques to identify important frames in videos to improve the interpretability of time - series data. ### Summary The main objective of this paper is to explore and verify the application of the Transformer - based Spatio - Temporal Attention Network (STAN) in time - series interpretation, especially to identify important frames in videos through gradient - based explanation techniques. The research results show that the STAN model performs well in video classification and time - series interpretation tasks when combining global and local views, but still has certain limitations when dealing with long - sequence videos.