Abstract:Current pedestrian attribute recognition (PAR) algorithms are developed based on multi-label or multi-task learning frameworks, which aim to discriminate the attributes using specific classification heads. However, these discriminative models are easily influenced by imbalanced data or noisy samples. Inspired by the success of generative models, we rethink the pedestrian attribute recognition scheme and believe the generative models may perform better on modeling dependencies and complexity between human attributes. In this paper, we propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR. It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts. Then, a Transformer decoder is proposed to generate the human attributes by incorporating the visual features and attribute query tokens. The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training. Extensive experiments on multiple widely used pedestrian attribute recognition datasets fully validated the effectiveness of our proposed SequencePAR. The source code and pre-trained models will be released at <a class="link-external link-https" href="https://github.com/Event-AHU/OpenPAR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in Pedestrian Attribute Recognition (PAR): 1. **The impact of data imbalance and noisy samples**: - Existing PAR algorithms based on multi - label or multi - task learning frameworks are vulnerable to imbalanced data or noisy samples. For example, a large number of negative samples may lead to sparse attribute prediction, and inaccurately labeled datasets may introduce noise and affect model performance. 2. **Weak semantic connection between attributes**: - Existing methods usually regress each attribute simultaneously, resulting in a weak semantic connection between attributes and being unable to fully capture the complex dependencies between attributes. 3. **Limitations of traditional discriminative models**: - Discriminative models perform poorly when dealing with complex, highly - dependent tasks. They distinguish attributes through specific classification heads but have difficulty modeling the complexity and dependencies between attributes. ### Proposed solutions To solve the above problems, the authors propose a new pedestrian attribute recognition framework based on sequence generation, called **SequencePAR**. Its main innovations include: 1. **Redefining the attribute recognition task as a sequence generation problem**: - By regarding attribute recognition as an image caption generation task, the relationships between attributes can be better modeled. Specifically, SequencePAR uses a pre - trained CLIP model to extract pedestrian image features and embeds attribute descriptions into query tokens. Then, it uses a Transformer decoder to generate an attribute sequence. 2. **Introducing the masked multi - head attention mechanism**: - During the training process, a masked multi - head attention layer is introduced to prevent the model from memorizing the next attribute, ensuring that the current attribute prediction only depends on the previous context information. This helps to improve the model's generalization ability and robustness. 3. **Fusing visual and text features**: - SequencePAR not only utilizes the visual features of pedestrian images but also combines the text representations of attributes, thereby more comprehensively capturing the semantic relationships between attributes. ### Experimental verification The authors conducted a large number of experiments on several widely - used pedestrian attribute recognition datasets to verify the effectiveness of SequencePAR. The experimental results show that SequencePAR outperforms existing state - of - the - art methods in multiple metrics, especially in dealing with imbalanced data and noisy samples. ### Summary This paper re - examines the pedestrian attribute recognition task by introducing the idea of generative models and proposes the novel framework SequencePAR. This framework can better model the complex dependencies between attributes, improves the model's robustness and generalization ability, and solves the problems of data imbalance and noisy samples in existing methods.

SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Pedestrian Attribute Recognition Via Spatio-temporal Relationship Learning for Visual Surveillance

Dual-branch Self-Attention Network for Pedestrian Attribute Recognition

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Generate and adjust: a novel framework for semi-supervised pedestrian attribute recognition

Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition

A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition

Pedestrian attribute recognition: A survey

Orientation-Aware Pedestrian Attribute Recognition based on Graph Convolution Network

An Empirical Study of Mamba-based Pedestrian Attribute Recognition

SNN-PAR: Energy Efficient Pedestrian Attribute Recognition via Spiking Neural Networks

SSPNet: Scale and Spatial Priors Guided Generalizable and Interpretable Pedestrian Attribute Recognition

An efficient pedestrian attributes recognition system under challenging conditions

Exponential Information Bottleneck Theory Against Intra-Attribute Variations for Pedestrian Attribute Recognition

Deep Template Matching for Pedestrian Attribute Recognition with the Auxiliary Supervision of Attribute-wise Keypoints

A novel self-boosting dual-branch model for pedestrian attribute recognition

MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery

Crossmodal Transformer Based Generative Framework for Pedestrian Trajectory Prediction

Recurrent Attention Model for Pedestrian Attribute Recognition.