ITCD: Image to Text Translation for Classification by Diffusion Models

Yanxiang Ma,Yi Xu
DOI: https://doi.org/10.1145/3688864.3689148
2024-01-01
Abstract:Traditional classification models use one-hot vectors for categories, while modern image-text multi-modal models use textual representations for better flexibility and performance. These models align images and text by projecting images into the text space, enhancing classification and downstream tasks. We focus on the generative model for projection, specifically the diffusion model. The diffusion model gradually transforms data from noise to a target distribution, serving as a learnable projection process. This approach is applied to cross-modal projection for image classification and related tasks. It involves obtaining a latent space for an image and a set of text representations for possible image classes from pre-trained encoders. The diffusion model then projects image features onto the corresponding text feature distribution, and the closest text feature determines the image's class. Furthermore, this paper validates the advantages of diffusion-based classifiers in tasks for out-of-domain (OOD) data detection, where our model achieves state-of-the-art (SOTA) performance without additional training.
What problem does this paper attempt to address?