Abstract:The ability to correctly classify and retrieve apparel images has a variety of applications important to e-commerce, online advertising and internet search. In this work, we propose a robust framework for fine-grained apparel classification, in-shop and cross-domain retrieval which eliminates the requirement of rich annotations like bounding boxes and human-joints or clothing landmarks, and training of bounding box/ key-landmark detector for the same. Factors such as subtle appearance differences, variations in human poses, different shooting angles, apparel deformations, and self-occlusion add to the challenges in classification and retrieval of apparel items. Cross-domain retrieval is even harder due to the presence of large variation between online shopping images, usually taken in ideal lighting, pose, positive angle and clean background as compared with street photos captured by users in complicated conditions with poor lighting and cluttered scenes. Our framework uses compact bilinear CNN with tensor sketch algorithm to generate embeddings that capture local pairwise feature interactions in a translationally invariant manner. For apparel classification, we pass the feature embeddings through a softmax classifier, while, the in-shop and cross-domain retrieval pipelines use a triplet-loss based optimization approach, such that squared Euclidean distance between embeddings measures the dissimilarity between the images. Unlike previous works that relied on bounding box, key clothing landmarks or human joint detectors to assist the final deep classifier, proposed framework can be trained directly on the provided category labels or generated triplets for triplet loss optimization. Lastly, Experimental results on the DeepFashion fine-grained categorization, and in-shop and consumer-to-shop retrieval datasets provide a comparative analysis with previous work performed in the domain.

Inferring the Importance of Product Appearance with Semi-supervised Multi-modal Enhancement: A Step Towards the Screenless Retailing.

Inferring the Importance of Product Appearance: A Step Towards the Screenless Revolution

Better Than Humans: a Method for Inferring Consumer Shopping Intentions by Reading Facial Expressions

Matryoshka Peek: Toward Learning Fine-Grained, Robust, Discriminative Features for Product Search

Interpretable Multimodal Retrieval for Fashion Products.

Design of Smart Unstaffed Retail Shop Based on IoT and Artificial Intelligence

VSEM-SAMMI: An Explainable Multimodal Learning Approach to Predict User-Generated Image Helpfulness and Product Sales

Machine learning based approach for exploring online shopping behavior and preferences with eye tracking

When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

Smart customer service in unmanned retail store enhanced by large language model

Image Score: Learning and Evaluating Human Preferences for Mercari Search

SEMI: A Sequential Multi-Modal Information Transfer Network for E-Commerce Micro-Video Recommendations

Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

When relevance is not Enough: Promoting Visual Attractiveness for Fashion E-commerce

Complete the Look: Scene-based Complementary Product Recommendation

Attending to Customer Attention: A Novel Deep Learning Method for Leveraging Multimodal Online Reviews to Enhance Sales Prediction

The Impact of Product Photo on Online Consumer Purchase Intention: an Image-Processing Enabled Empirical Study.

Fine-grained Apparel Classification and Retrieval without rich annotations

Multimodal Joint Attribute Prediction and Value Extraction for E-commerce Product