Abstract:The convolutional neural networks commonly deployed for semantic understanding of visual inspection data can, in general, learn robust spatial features. However, they lack the ability to capture temporal dependencies that characterize the video data collected by various robotic inspection systems. As a result, they are found lacking in dealing with various challenges arising from cross-view illumination variation, perspective difference, scale change, background clutter, and occlusion. Their performance is further deteriorated by motion blur and other distortions induced by rapid camera movement. This study aims to address this challenge by extending the task of visual scene understanding from the still image domain to the video domain by incorporating cross-frame information fusion. A deep end-to-end network is developed by integrating an encoder–decoder-based convolutional neural network with a long short-term memory-based recurrent neural network for pixel-level semantic labeling of sequential visual inspection data. The proposed multishot architecture can jointly learn discriminative fusion features leading to a rich understanding of the complex spatiotemporal dynamics. The proposed approach is validated with two case studies involving automatic structural element segmentation in robotic building and bridge inspection videos. Two different multishot fusion techniques are suggested leveraging sequence-to-one and sequence-to-sequence architectures. Additionally, two different fusion schemes based on the sum-of-scores and Bayesian updating rules are examined to aggregate multiple label maps produced at each time step by an overlapping sliding window-based inference scheme. A comprehensive performance evaluation indicated that multishot fusion could enhance the intersection over union (IoU) score by 4.6% and 13.3% for building and bridge component segmentation tasks, respectively, compared to a baseline single-shot approach.

STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Multi-Frame Image Fusion Method Combining Spatial-Temporal Saliency Detection and Nsct

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

FFAVOD: Feature fusion architecture for video object detection

Multi-spectral Image Fusion for Moving Object Detection

Spatio-temporal Interactive Fusion Based Visual Object Tracking Method

Joint Spatial and Temporal Feature Enhancement Network for Disturbed Object Detection

Multi-Stage Spatio-Temporal Fusion Network for Fast and Accurate Video Bit-Depth Enhancement

Spatio-temporal Fusion with Motion Masks for the Moving Small Target Detection from Remote-Sensing Videos

Video object detection via space–time feature aggregation and result reuse

LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection

Spatio-Temporal Fusion Networks for Action Recognition

Deep spatiotemporal fusion network for vision-based robotic inspection of structures

STDF: Spatio-Temporal Deformable Fusion for Video Quality Enhancement on Embedded Platforms

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Multi-Frame Compressed Video Quality Enhancement by Spatio-Temporal Information Balance

Deep Fusion Module for Video Action Recognition

StfNet: A Two-Stream Convolutional Neural Network for Spatiotemporal Image Fusion

Spatial-temporal Fusion Network for Fast Video Shadow Detection.

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

Temporal-adaptive sparse feature aggregation for video object detection