Abstract:In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing automatic generation of product descriptions in a wide range of applications.

Product promotion copywriting from multimodal data: New benchmark and model

Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information

Poet: Product-oriented Video Captioner for E-commerce

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Generating Attractive and Authentic Copywriting from Customer Reviews

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Improving Multimodal Datasets with Image Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Automatic Product Copywriting for E-Commerce

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

VideoMCC: a New Benchmark for Video Comprehension

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Chinese Title Generation for Short Videos: Dataset, Metric and Algorithm

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

OmniCorpus: an Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text