Abstract:Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy.

Category-aware Allocation Transformer for Weakly Supervised Object Localization

LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization

HiCT: Hierarchical Comprehend of Transformer for Weakly Supervised Object Localization

Task-Aware Weakly Supervised Object Localization With Transformer

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Semantic-Constraint Matching for Transformer-Based Weakly Supervised Object Localization

Re-Attention Transformer for Weakly Supervised Object Localization

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Adversarial Transformers for Weakly Supervised Object Localization

Spatial-Aware Token for Weakly Supervised Object Localization

Weakly Supervised Object Localization Using Long-Range Semantic Foreground Activation

Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

Reperceive Global Vision of Transformer for Remote Sensing Images Weakly Supervised Object Localization

CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection.

CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization

Multiscale Vision Transformer With Deep Clustering-Guided Refinement for Weakly Supervised Object Localization

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

Unifying Global-Local Representations in Salient Object Detection with Transformer

Unifying Global-Local Representations in Salient Object Detection with Transformers

EGSA: Enhanced and Global Semantic Activation for Weakly Supervised Object Localization.

Dual Progressive Transformations for Weakly Supervised Semantic Segmentation