Abstract:The ability to correctly classify and retrieve apparel images has a variety of applications important to e-commerce, online advertising and internet search. In this work, we propose a robust framework for fine-grained apparel classification, in-shop and cross-domain retrieval which eliminates the requirement of rich annotations like bounding boxes and human-joints or clothing landmarks, and training of bounding box/ key-landmark detector for the same. Factors such as subtle appearance differences, variations in human poses, different shooting angles, apparel deformations, and self-occlusion add to the challenges in classification and retrieval of apparel items. Cross-domain retrieval is even harder due to the presence of large variation between online shopping images, usually taken in ideal lighting, pose, positive angle and clean background as compared with street photos captured by users in complicated conditions with poor lighting and cluttered scenes. Our framework uses compact bilinear CNN with tensor sketch algorithm to generate embeddings that capture local pairwise feature interactions in a translationally invariant manner. For apparel classification, we pass the feature embeddings through a softmax classifier, while, the in-shop and cross-domain retrieval pipelines use a triplet-loss based optimization approach, such that squared Euclidean distance between embeddings measures the dissimilarity between the images. Unlike previous works that relied on bounding box, key clothing landmarks or human joint detectors to assist the final deep classifier, proposed framework can be trained directly on the provided category labels or generated triplets for triplet loss optimization. Lastly, Experimental results on the DeepFashion fine-grained categorization, and in-shop and consumer-to-shop retrieval datasets provide a comparative analysis with previous work performed in the domain.

Where to Look and How to Describe: Fashion Image Retrieval With an Attentional Heterogeneous Bilinear Network

Fashion Recommendation on Street Images.

Interpretable Multimodal Retrieval for Fashion Products.

MMFL-Net: Multi-scale and Multi-granularity Feature Learning for Cross-domain Fashion Retrieval

Personalized Fashion Recommendation with Visual Explanations Based on Multimodal Attention Network

Fine-grained Apparel Classification and Retrieval without rich annotations

Cross-Domain Image Retrieval with Attention Modeling

Describe Fashion Products via Local Sparse Self-Attention Mechanism and Attribute-based Re-sampling Strategy

Searching for Apparel Products from Images in the Wild

AE-Net: Fine-grained Sketch-Based Image Retrieval Via Attention-Enhanced Network

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Search by Image: Beauty Product Retrieval Network via Salient Attention

Search By Image: Deeply Exploring Beneficial Features for Beauty Product Retrieval

Hierarchical Cross-Attention Network for Virtual Try-On

Fashion Analysis With A Subordinate Attribute Classification Network

Attentive Fashion Grammar Network For Fashion Landmark Detection And Clothing Category Classification

Texture and Shape Biased Two-Stream Networks for Clothing Classification and Attribute Recognition

Hierarchical Similarity Learning for Language-Based Product Image Retrieval

Deep Fashion Analysis with Feature Map Upsampling and Landmark-Driven Attention

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval