Abstract:In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at https://github.com/NeuCV-IRMI/proclip.

CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

PointCLIP: Point Cloud Understanding by CLIP

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation Via Exploiting CLIP Cues

Synthetic Depth Image-based Category-Level Object Pose Estimation with Effective Pose Decoupling and Shape Optimization

Category-level Pose Estimation and Iterative Refinement for Monocular RGB-D Image

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

PRO-CLIP: A CLIP-Based Category Measurement Network Through Prototype and Regularized Optimal Transportation

HS-Pose: Hybrid Scope Feature Extraction for Category-level Object Pose Estimation