EDSNet: Efficient-DSNet for Video Summarization

Ashish Prasad,Pranav Jeevan,Amit Sethi
2024-09-23
Abstract:Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive consumption of computational resources when existing video summarization methods handle long - video sequences. Specifically, most of the current video summarization methods rely on Transformer - based architectures. Due to the time complexity of the self - attention mechanism being \(O(n^2)\), these architectures require a large amount of computational resources and become impractical especially when dealing with large - scale data (such as videos on social media, surveillance videos and streaming media platforms). To meet this challenge, the paper proposes an improved Direct - to - Summarize Network (DSNet). By introducing more efficient Token mixing mechanisms (such as Fourier transform, wavelet transform and Nyströmformer) to replace the traditional self - attention mechanism, it significantly reduces the computational complexity and improves the performance. In addition, the paper also explores different pooling strategies in the region proposal network (such as ROI pooling, fast Fourier transform pooling and flat pooling) to further optimize the efficiency and effectiveness of the model. The experimental results show that these improvements not only greatly reduce the computational cost but also maintain competitive summarization performance on the TVSum and SumMe datasets. Therefore, this research provides a more scalable video summarization solution. ### Key point summary: 1. **Problem background**: Existing Transformer - based video summarization methods consume too many computational resources when handling long videos. 2. **Solution**: Improve DSNet by introducing more efficient Token mixing mechanisms and different pooling strategies. 3. **Experimental results**: On the TVSum and SumMe datasets, the improved model significantly reduces the computational cost while maintaining good summarization performance.