Abstract:Recently, CNN-based post-processing has shown great potential in Synthesized View Quality Enhancement (SVQE). However, due to the limited receptive field of convolution, it is ineffective in explicitly modeling long-range dependencies, which are critical to eliminate the distortion induced by Depth Image Based Rendering (DIBR) in synthesized views. Although transformers exhibit tremendous success at learning global contextual information, it is weak at extracting local texture information. To take full advantages of the CNN and transformer, we present a novel U-shaped hybrid transformer with asymmetric flow division to collaboratively capture global-local information for SVQE, termed as AFD-former. Specifically, the AFD-former utilizes the Transformer-CNN Block (TCB) as encoder and decoder, in which several Dynamic Hybrid Attention Blocks (DHABs) are designed to simultaneously model long-range interactions and retain texture details. Then, considering that the deeper layers of the U-shaped network play more roles in capturing global information while shallow layers more in extracting local information, an Asymmetric Flow Division Unit (AFDU) is embedded into each DHAB to assign different contributions of global-local contextual information to the transformer and CNN branches across different layers. Finally, a dynamic learnable modulator is incorporated into two branches to help model effectively feature representation learning. That can be viewed as the dynamic process of adjusting the weight for each channel of the input feature based on contextual cues. Extensive experiments demonstrate that the proposed AFD-former can significantly enhance perceptual quality of synthesized views with similar SVQE speed compared with the related state-of-the-art SVQE methods. The source code will be available at https://github.com/House-yuyu/AFD-former.

DFAformer: A Dual Filtering Auxiliary Transformer for Efficient Online Action Detection in Streaming Videos.

MALT: Multi-scale Action Learning Transformer for Online Action Detection

Long Short-Term Transformer for Online Action Detection

OadTR: Online Action Detection with Transformers

Efficient Video Action Detection with Token Dropout and Context Refinement.

Memory-and-Anticipation Transformer for Online Action Understanding

Real-time Online Video Detection with Temporal Smoothing Transformers

SODFormer: Streaming Object Detection with Transformer Using Events and Frames

Spatial–Temporal Context-Aware Online Action Detection and Prediction

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Time‐attentive fusion network: An efficient model for online detection of action start

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

AFD-former: A Hybrid Transformer with Asymmetric Flow Division for Synthesized View Quality Enhancement

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

F2S-Net: learning frame-to-segment prediction for online action detection

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

TLS-RWKV: Real-Time Online Action Detection with Temporal Label Smoothing

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

An empirical study on temporal modeling for online action detection