Abstract:Infrared target detection has important applications in rescue and Earth observation. However, the disadvantages of low signal-to-clutter ratios and severe background noise interference for infrared imaging pose great challenges to the detection technology for infrared dim targets. Most algorithms only extract features from the spatial domain, while the lack of temporal information results in unsatisfactory detection performance when the difference between the target and the background is not significant enough. Although some methods utilize temporal information in the detection process, these nonlearning-based methods fail to incorporate the complex and changeable background, and need to adjust parameters according to the input. To tackle this problem, we proposed a Spatio-Temporal Differential Multiscale Attention Network (STDMANet), a learning-based method for multiframe infrared small target detection in this article. Our STDMANet first used the temporal multiscale feature extractor to obtain spatiotemporal (ST) features from multiple time scales and then resorted them to the spatial multiscale feature refiner to enhance the semantics of ST features on the premise of maintaining the position information of small targets. Finally, unlike other learning-based networks that require binary masks for training, we designed a mask-weighted heatmap loss to train the network with only center point annotations. At the same time, the proposed loss can balance missing detection and false alarm, so as to achieve a good balance between finding the targets and suppressing the background. Extensive quantitative experiments on public datasets validated that the proposed STDMANet could improve the metric ${F_{1}}$ score up to 0.9744, surpassing the state-of-the-art baseline by 0.1682. Qualitative experiments show the proposed method could stably extract foreground moving targets from video sequences with various backgrounds while reducing false alarm rate better than other recent baseline methods.

Differential motion attention network for efficient action recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

Fine-gained Motion Enhancement for Action Recognition: Focusing on Action-Related Regions

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

CANet: Comprehensive Attention Network for video-based action recognition

Learning Comprehensive Motion Representation for Action Recognition

Spatio-temporal attention on manifold space for 3D human action recognition

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

TDN: Temporal Difference Networks for Efficient Action Recognition

An efficient attention module for 3d convolutional neural networks in action recognition

ADfM-Net: an Adversarial Depth-From-Motion Network Based on Cross Attention and Motion Enhanced

STDMANet: Spatio-Temporal Differential Multiscale Attention Network for Small Moving Infrared Target Detection

AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding

Deep manifold-to-manifold transforming network for action recognition