Abstract:Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motion-aware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a milestone for real-world video SCI. The source code and data are available at

What problem does this paper attempt to address?

The paper attempts to address two main issues in Video Snapshot Compressive Imaging (SCI): 1. **Low Dynamic Range**: Due to high temporal multiplexing, existing video SCI systems are limited in their dynamic range when capturing video frames. Specifically, when using random binary masks, the measurable brightness values are far fewer than the available brightness values of the image sensor, resulting in each video frame being represented by a limited range of brightness values, which does not match the wide range of brightness variations in natural scenes. 2. **Performance Degradation of Existing Deep Learning Algorithms in Real Systems**: Existing deep learning reconstruction networks perform well on simulated data but show significant performance degradation in real systems. This is because the existing forward models only consider optical transmission and modulation, ignoring the sensor response, leading to a gap between the model and the real system. To address these challenges, the paper proposes a deep optics framework aimed at jointly optimizing the masks and the reconstruction network. The specific contributions are as follows: - **Novel Structured Masks**: Unlike the widely used random binary masks, a new structured mask is proposed that enables motion-aware and Full-Dynamic-Range (FDR) measurements. This mask not only improves the dynamic range but also retains more visual information, aiding in the reconstruction of video SCI. - **Efficient Reconstruction Network**: Considering the motion-aware characteristics of the encoder, an efficient reconstruction network called Res2former is designed, using Transformers to capture long-term temporal dependencies. Compared to the state-of-the-art network STFormer, Res2former is more lightweight in terms of parameter count and computational complexity while achieving comparable performance. - **End-to-End Training**: The proposed deep optics framework incorporates sensor response, ensuring end-to-end training from encoding to decoding that closely approximates the real system. Experimental results show that this framework achieves significant improvements on both synthetic and real data. Through these innovations, the paper provides an important milestone for video SCI in practical applications.

Deep Optics for Video Snapshot Compressive Imaging

Deep Motion Regularizer for Video Snapshot Compressive Imaging

Deep learning for video compressive sensing

EfficientSCI: Densely Connected Network with Space-time Factorization for Large-scale Video Snapshot Compressive Imaging

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

MetaSCI: Scalable and Adaptive Reconstruction for Video Compressive Sensing

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

Snapshot Compressive Imaging: Principle, Implementation, Theory, Algorithms and Applications

Snapshot Compressive Imaging Using Domain-Factorized Deep Video Prior

Key frames assisted hybrid encoding for photorealistic compressive video sensing

Plug-and-Play Algorithms for Video Snapshot Compressive Imaging

Recent Advances of Deep Learning for Spectral Snapshot Compressive Imaging

Provable deep video denoiser using spatial-temporal information for video snapshot compressive imaging: Algorithm and convergence analysis

Memory-Efficient Network for Large-scale Video Compressive Sensing

Deep-Learning Supervised Snapshot Compressive Imaging Enabled by an End-to-End Adaptive Neural Network.

Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging

Plug-and-Play Algorithms for Large-scale Snapshot Compressive Imaging

Deep Equilibrium Models for Video Snapshot Compressive Imaging

Event-Enhanced Snapshot Compressive Videography at 10K FPS

GAPMSF-Net: Generalized Alternating Projection With Multi-Stage Fusion Network for Snapshot Compressive Imaging

Key Frames Assisted Hybrid Encoding for High-Quality Compressive Video Sensing