Deep Optics for Video Snapshot Compressive Imaging

Ping Wang,Lishun Wang,Xin Yuan
DOI: https://doi.org/10.1109/ICCV51070.2023.00977
2024-04-08
Abstract:Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motion-aware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a milestone for real-world video SCI. The source code and data are available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address two main issues in Video Snapshot Compressive Imaging (SCI): 1. **Low Dynamic Range**: Due to high temporal multiplexing, existing video SCI systems are limited in their dynamic range when capturing video frames. Specifically, when using random binary masks, the measurable brightness values are far fewer than the available brightness values of the image sensor, resulting in each video frame being represented by a limited range of brightness values, which does not match the wide range of brightness variations in natural scenes. 2. **Performance Degradation of Existing Deep Learning Algorithms in Real Systems**: Existing deep learning reconstruction networks perform well on simulated data but show significant performance degradation in real systems. This is because the existing forward models only consider optical transmission and modulation, ignoring the sensor response, leading to a gap between the model and the real system. To address these challenges, the paper proposes a deep optics framework aimed at jointly optimizing the masks and the reconstruction network. The specific contributions are as follows: - **Novel Structured Masks**: Unlike the widely used random binary masks, a new structured mask is proposed that enables motion-aware and Full-Dynamic-Range (FDR) measurements. This mask not only improves the dynamic range but also retains more visual information, aiding in the reconstruction of video SCI. - **Efficient Reconstruction Network**: Considering the motion-aware characteristics of the encoder, an efficient reconstruction network called Res2former is designed, using Transformers to capture long-term temporal dependencies. Compared to the state-of-the-art network STFormer, Res2former is more lightweight in terms of parameter count and computational complexity while achieving comparable performance. - **End-to-End Training**: The proposed deep optics framework incorporates sensor response, ensuring end-to-end training from encoding to decoding that closely approximates the real system. Experimental results show that this framework achieves significant improvements on both synthetic and real data. Through these innovations, the paper provides an important milestone for video SCI in practical applications.