Abstract:The Non-local self-attention mechanism can significantly improve the capability of feature representation with long-range dependencies at the cost of high computational complexity. To address the issue, the self-attention-based autoregressive axial transformer has been proposed to apply attention along a single axis of the feature maps instead of the whole ones with large receptive fields. It performs axial-attention twice along the height- and width-axis respectively in the spatial dimension of the feature maps for the image data. However, there is still room for improvement. We can convert the 2D spatial feature map into a 1D feature sequence and just perform axial-attention once along it to save more computing resources. Motivated by the insight, we propose an Efficient Axial-Attention Network (EAAN) for video-based person re-identification (Re-ID) to reduce computation and improve accuracy by serializing feature maps with multi-granularity and reducing the number of axial-attention runs. We also introduce a deserialization approach that can restore the shape of the feature maps. Moreover, we expand the CTN (Channel Transformer Network) to a wider range of uses. Additionally, we verify that the serialized feature sequence can enhance expressiveness in our EAAN with lower complexity. Experiments on MARS and DukeMTMC-VideoReID (DukeV) datasets show outstanding performance in computation efficiency and accuracy. It not only outperforms the state-of-the-art method on MARS by 0.3% in both Rank-1 and mAP, and surpasses that on DukeV by 0.1% in Rank-1 with equal mAP, but also reduces parameters and GFLOPS (Giga Floating-point Operations Per Second) by 16.9% and 6.6% respectively compared to another axial-attention-based method.

Parallel Attention with Weighted Efficient Network for Video-Based Person Re-Identification.

Deep Siamese Network with Multi-level Similarity Perception for Person Re-identification

Contribution-Based Multi-Stream Feature Distance Fusion Method with ${k}$ -Distribution Re-Ranking for Person Re-Identification

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Diverse Part Attentive Network for Video-Based Person Re-Identification *

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

Video-based Person Re-identification with Two-stream Convolutional Network and Co-attentive Snippet Embedding

ASTA-Net: Adaptive Spatio-Temporal Attention Network for Person Re-Identification in Videos.

Concentrated Multi-Grained Multi-Attention Network for Video Based Person Re-Identification

Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach

Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Information complementary attention-based multidimension feature learning for person re-identification

An Efficient Axial-Attention Network for Video-Based Person Re-Identification

Triplet Attention Network for Video-Based Person Re-Identification

An efficient feature pyramid attention network for person re-identification