Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring

Wei Shang,Dongwei Ren,Yi Yang,Wangmeng Zuo
DOI: https://doi.org/10.1016/j.ins.2024.121689
2024-11-29
Abstract:Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world scenarios captured by modern imaging devices, sharp frames often interspersed within the video, providing temporally nearest sharp features that can aid in the restoration of blurry frames. In this work, we propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at <a class="link-external link-https" href="https://github.com/shangwei5/STGTN" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use the already - existing clear frames in the video to assist in restoring the blurred frames in video deblurring. Specifically, the existing video deblurring methods usually assume that the frames in the input video are continuously blurred, but in actual scenarios, there are often some clear frames interspersed in the video. These clear frames can provide the clearest features in the nearest time and are helpful for restoring the blurred frames. Therefore, this paper proposes a new video deblurring method. By using a hybrid Transformer to aggregate the features of neighboring frames and detected clear frames, the deblurring effect is improved. ### Main contributions 1. **Hybrid Transformer framework**: A new video deblurring framework is proposed, which uses a hybrid Transformer to aggregate the features from the detected clear frames and neighboring frames. 2. **Blur - aware detector**: A blur - aware detector is trained to distinguish between clear frames and blurred frames, so that the nearest clear features can be extracted to help restore the blurred frames. 3. **Event fusion module**: An event fusion module is introduced, enabling the method to be extended to event - driven video deblurring, effectively bridging the gap between traditional video deblurring and event - driven video deblurring. ### Method overview 1. **Blur - aware detector**: - Use bidirectional LSTM (BiLSTM) as a classifier and combine ResNet - 152 to extract features. - Jointly train through the binary cross - entropy loss function and the supervised contrastive loss function to enhance the generalization ability for real - world blurred videos. 2. **Hybrid Transformer**: - **Window - based local Transformer**: Use the cross - attention shifted - window Transformer (CSWT) block at the third scale to aggregate the features of neighboring frames without explicit spatial alignment. - **Global Transformer**: Implemented in a multi - scale scheme, aggregate the nearest clear features of the detected clear frames through the global attention mechanism. 3. **Event fusion module**: - Introduce an event fusion module, enabling the method to be extended to event - driven video deblurring and improving performance without significantly increasing the computational overhead. ### Experimental results - Extensive experiments were carried out on multiple benchmark datasets, including synthetic datasets (such as GOPRO and REDS) and real - world datasets (such as BSD), verifying the effectiveness and generalization ability of the method. - For the event - driven video deblurring task, the performance was evaluated using the CED and RBE datasets, and the results show that this method is superior to the existing methods in both quantitative indicators and visual quality. ### Summary This paper proposes a new video deblurring method. By using a hybrid Transformer to aggregate the features of neighboring frames and detected clear frames, the deblurring effect is significantly improved. In addition, the introduced event fusion module enables the method to be extended to event - driven video deblurring, further enhancing the performance.