Abstract:Intelligent video understanding can be defined as the integration of video technology and analytics that can be used for a variety of purposes such as tracking movements or events.The tasks involving video processing, perception and understanding are receiving increasing attention in the remit of computer vision, pattern recognition and machine learning.The advent of deep learning models demonstrates the significance of both low-level and high-level video interpretation in real-world applications, e.g., super-pixel volumetric restoration, automatic driving, human-computer interaction, robotics, and video surveillance, etc.In contrasting to images, videos provide more sequential information, and thus, video streams are highly valuable to compensate the defections of the images.However, understanding videos is much more challenging than dealing with the image counterpart, due to the space-time complexity.Video transformer network has recently emerged as an effective alternative to convolutional networks for video tasks, such as action classification, video object/instance segmentation, etc. Inspired by recent developments in vision transformers (ViT), the video transformers operate on both spatial-temporal queries across temporal steps.The enclosed temporal self-attention and spatial cross-attention offer a premise to many video recognition tasks.To embrace the emerging challenges in intelligent video understanding, this special issue establishes a venue to bring brilliant ideas and advanced technological research outcome across the global research and industrial communities.This special issue prompts the engagement in the field that has relevance to advanced transformer-based algorithms in video applications.It also highlights ongoing investigations and new applications.Prospective submissions may fall into, but are not limited to the following topics:Optical flow estimation Depth estimation from video streams Video object/instance/panoptic segmentation Motion estimation Multi object tracking Anomaly event detection Supervised, weak supervised, or unsupervised representation learning methods for video understanding Video generation and intelligent editing Light-weight networks for long-video processing The authors are requested to submit their full research papers complying with the general scope of the journal.

Guest Editorial Introduction to the Special Issue on Video Transformers

TransVOS: Video Object Segmentation with Transformers

Call for Papers: Special Issue on Intelligent Network Video Advances Based on Transformers

A Survey of Visual Transformers

Space or time for video classification transformers

Transformers in computational visual media: A survey

A Survey on Visual Transformer

Guest Editorial Introduction to the Special Section on Video and Language

Efficient Video Transformers with Spatial-Temporal Token Selection

Understanding Video Transformers via Universal Concept Discovery

A Survey on Vision Transformer

Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Transformers in Vision: A Survey

Transformers Meet Visual Learning Understanding: A Comprehensive Review

Vision Transformers: State of the Art and Research Challenges

HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

Spatial-Temporal Transformer based Video Compression Framework

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

VDTR: Video Deblurring with Transformer