DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba

Shuning Xu,Xina Liu,Binbin Song,Xiangyu Chen,Qiubo Chen,Jiantao Zhou
2024-08-20
Abstract:Moire patterns arise when two similar repetitive patterns interfere, a phenomenon frequently observed during the capture of images or videos on screens. The color, shape, and location of moire patterns may differ across video frames, posing a challenge in learning information from adjacent frames and preserving temporal consistency. Previous video demoireing methods heavily rely on well-designed alignment modules, resulting in substantial computational burdens. Recently, Mamba, an improved version of the State Space Model (SSM), has demonstrated significant potential for modeling long-range dependencies with linear complexity, enabling efficient temporal modeling in video demoireing without requiring a specific alignment module. In this paper, we propose a novel alignment-free Raw video demoireing network with frequency-assisted spatio-temporal Mamba (DemMamba). The Spatial Mamba Block (SMB) and Temporal Mamba Block (TMB) are sequentially arranged to facilitate effective intra- and inter-relationship modeling in Raw videos with moire patterns. Within SMB, an Adaptive Frequency Block (AFB) is introduced to aid demoireing in the frequency domain. For TMB, a Channel Attention Block (CAB) is embedded to further enhance temporal information interactions by exploiting the inter-channel relationships among features. Extensive experiments demonstrate that our proposed DemMamba surpasses state-of-the-art approaches by 1.3 dB and delivers a superior visual experience.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is video demoiréing, especially for the moiré phenomenon in Raw videos. Moiré is caused by the interference between two similar repeating patterns, and this phenomenon often occurs when capturing images or videos on the screen. The color, shape, and position of moiré may vary in different video frames, which makes it challenging to learn information from adjacent frames and maintain temporal consistency. Existing video demoiréing methods usually rely on elaborately - designed alignment modules, which will lead to a significant computational burden, especially when dealing with high - resolution long - video sequences. To solve these problems, this paper proposes a new alignment - free Raw video demoiréing network, called DemMamba, which utilizes the frequency - assisted spatio - temporal Mamba model. Specifically, DemMamba solves the problem in the following ways: 1. **Introducing Spatial Mamba Block (SMB) and Temporal Mamba Block (TMB)**: These two blocks are arranged in sequence to effectively model the internal and external relationships of moiré in Raw videos. 2. **Adaptive Frequency Block (AFB)**: Introduce AFB in SMB to help with demoiréing in the frequency domain. 3. **Channel Attention Block (CAB)**: Embed CAB in TMB to further enhance the temporal information interaction by exploiting the inter - channel relationships between features. Through these designs, DemMamba can not only effectively remove moiré but also maintain the temporal consistency of the video, and has higher efficiency and better visual effects compared to existing methods. ### Formula Representation The formulas involved in the paper mainly include the discretization process of the State Space Model (SSM), which is as follows: The linear ordinary differential equation (ODE) of a continuous - time linear time - invariant (LTI) system can be represented as: \[ h'(t)=Ah(t)+Bx(t), \] \[ y(t)=Ch(t)+Dx(t), \] where \( N \) represents the state size, \( A\in\mathbb{R}^{N\times N} \), \( B\in\mathbb{R}^{N\times1} \), \( C\in\mathbb{R}^{1\times N} \), \( D\in\mathbb{R} \). The discretization process adopts the zero - order - hold (ZOH) rule, and the formulas are: \[ A = \exp(\Delta A), \] \[ B = (\Delta A)^{-1}(\exp(A)-I)\cdot\Delta B. \] The discretized recurrent neural network (RNN) form is: \[ h_k = Ah_{k - 1}+Bx_k, \] \[ y_k = Ch_k+Dx_k. \] The convolutional neural network (CNN) form is: \[ K\triangleq(CB,CAB,\ldots,CA^{L - 1}B), \] \[ y = x\circledast K, \] where \( L \) represents the length of the input sequence, \( \circledast \) represents the convolution operation, and \( K\in\mathbb{R}^L \) represents the structured convolution kernel. ### Summary This paper aims to develop an efficient, alignment - free Raw video demoiréing method. By introducing the frequency - assisted spatio - temporal Mamba model, it solves the problems of high computational complexity and poor temporal consistency in existing methods. The experimental results show that DemMamba outperforms existing methods in both quantitative and qualitative evaluations.