Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging

Siming Zheng,Xin Yuan
2023-06-20
Abstract:We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. The underlying principle of reconstructing multi-frame images from only one single measurement is to solve an ill-posed problem. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems. In this paper, our proposed model is under the DUN framework and we propose a 3D Convolution-Transformer Mixture (CTM) module with a 3D efficient and scalable attention model plugged in, which helps fully learn the correlation between temporal and spatial dimensions by virtue of Transformer. To our best knowledge, this is the first time that Transformer is employed to video SCI reconstruction. Besides, to further investigate the high-frequency information during the reconstruction process which are neglected in previous studies, we introduce variance estimation characterizing the uncertainty on a pixel-by-pixel basis. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) (with a 1.2dB gain in PSNR over previous SOTA algorithm) results. We will release the code.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the reconstruction problem in Video Snapshot Compressive Imaging (SCI). Specifically, the video SCI system modulates consecutive high - speed frames through different masks and captures these frames through a single measurement. The process of reconstructing multi - frame images from a single measurement is essentially an ill - posed problem because the number of pixels to be reconstructed is much larger than the number of known parameters. The paper proposes a method based on the Deep Unfolding Networks (DUNs) framework, combining the advantages of optimization algorithms and neural networks to solve this problem. In particular, the authors propose a 3D Convolution - Transformer Mixture (CTM) and an uncertainty estimation method to better capture the correlations in the spatio - temporal dimension and high - frequency information, thereby improving the reconstruction quality. ### Main Contributions 1. **Proposing the 3D Convolution - Transformer Mixture (CTM)**: This module can effectively capture local and global spatio - temporal interactions and consists of 3D convolution, 3D scalable blocked - dense attention, and 3D sparse attention. 2. **Introducing Uncertainty Estimation**: Uncertainty estimation is introduced as a regularization prior in video SCI for the first time, focusing on regions with high uncertainty and improving the reconstruction fidelity of high - frequency details. 3. **Applying Transformer for the First Time**: Transformer is applied to the video SCI reconstruction task for the first time. Experimental results show that this method improves the PSNR index by more than 1.2 dB compared to the existing best methods. ### Problems Solved - **Lack of High - Frequency Information**: Previous deep - learning methods mainly focused on low - frequency information, such as backgrounds and calm regions, while ignoring high - frequency features such as edges and textures. By introducing uncertainty estimation, the paper captures and utilizes this high - frequency information, improving the reconstruction quality. - **Insufficient Ability to Capture Global Features**: Traditional Convolutional Neural Networks (CNNs) perform poorly in capturing global features, while Transformer can well capture long - distance dependencies and global correlations. By combining CNNs and Transformer, the paper makes up for this deficiency. ### Method Overview - **Forward Model**: The forward model of video SCI describes the process in which multiple high - speed frames are modulated by different masks and then captured by a single measurement. The mathematical expression is: \[ Y=\sum_{t = 1}^{T}X_t\odot M_t+N \] where \(Y\) is a 2D measurement, \(X_t\) is the \(t\)-th frame, \(M_t\) is the \(t\)-th mask, \(N\) is the measurement noise, and \(\odot\) represents element - wise multiplication. - **DUN Framework**: The SCI reconstruction problem can be modeled as an optimization problem: \[ x=\arg\min_x\|y - \Phi x\|^2_2+\lambda\psi(x) \] where \(\psi(x)\) is a regularization term and \(\lambda\) is a balancing parameter. The paper uses the Generalized Alternating Projection (GAP) framework to unfold the iterative process. - **Uncertainty Estimation**: Two decoding branches are used to learn the target estimate (mean) and uncertainty (variance) respectively, and the following loss function is used for training: \[ L_U=\exp(-\beta)\|x - f(y)\|^2_2+\beta \] where \(\beta = \ln\sigma^2\) is the log - variance, which is used to stabilize the training process. - **3D Convolution - Transformer Mixture (CTM)**: The CTM module consists of 3D Blocked - Dense Attention (BDA), 3D Sparse Attention (DSA), and 3D Convolution Feature Fusion (FF) and can efficiently capture local and global spatio - temporal correlations. ### Experimental Results - **Benchmark Tests**: Experiments were carried out on the DAVIS 2017 dataset, and the results...