Abstract:Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the performance issues of existing video enhancement networks in real - time applications, which specifically include the following aspects: 1. **Slow speed and high computational load**: Most existing video restoration networks have a slow processing speed and a high computational load, and cannot be used for real - time video enhancement, such as real - time video calls and video streaming. 2. **Poor temporal consistency**: Many methods require the input of past frames and future frames, which will introduce latency in streaming videos and affect real - time performance. 3. **Multiple degradation problems**: Videos in real - world scenarios are usually affected by multiple degradation factors (such as noise, blurring, compression artifacts, etc.), while existing video restoration methods are often optimized for only a single type of degradation. To solve these problems, the author proposes a new efficient framework - Recurrent Bottleneck Mixer Network (ReBotNet), aiming to achieve fast real - time video enhancement. ReBotNet solves the above problems in the following ways: - **Dual - branch architecture**: The first branch extracts spatio - temporal features through the ConvNext encoder and processes these features using the bottleneck mixer; the second branch directly mixes the features extracted from a single frame to further improve temporal consistency. - **Recursive training method**: Utilize the prediction result of the previous frame as an additional input to improve the temporal consistency and efficiency of the current frame prediction, while reducing the need for multiple frames as input and reducing the computational complexity. - **New datasets**: To better simulate real - world application scenarios, the author creates two new datasets - PortraitVideo and FullVideo, which contain cropped face videos and complete low - quality videos respectively, for evaluating the performance of the model in different scenarios. Through these innovations, ReBotNet can significantly reduce the computational resource requirements and accelerate the inference speed while maintaining high - quality video enhancement effects, thus being suitable for real - world application scenarios such as real - time video calls and streaming media.

ReBotNet: Fast Real-time Video Enhancement

RT-VENet: A Convolutional Network for Real-time Video Enhancement.

RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics

STARNet: Low-light Video Enhancement Using Spatio-Temporal Consistency Aggregation

Real-Time Neural Video Recovery and Enhancement on Mobile Devices

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Online Video Deblurring via Dynamic Temporal Blending Network

RANet: Ranking Attention Network for Fast Video Object Segmentation

Recurrent Residual Module for Fast Inference in Videos

LEARNING-BASED MULTI-FRAME VIDEO QUALITY ENHANCEMENT

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Deep RNN Framework for Visual Sequential Applications

Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning

VRT: A Video Restoration Transformer

RetinaViT: Efficient Visual Backbone for Online Video Streams

Accelerated Neural Enhancement for Video Analytics with Video Quality Adaptation

FastCNN: Towards Fast and Accurate Spatiotemporal Network for HEVC Compressed Video Enhancement.

FASTER Recurrent Networks for Efficient Video Classification

Event-Driven Video Restoration with Spiking-Convolutional Architecture

A 65-Nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework