Abstract:To interpret deep models' predictions, attention-based visual cues are widely used in addressing \textit{why} deep models make such predictions. Beyond that, the current research community becomes more interested in reasoning \textit{how} deep models make predictions, where some prototype-based methods employ interpretable representations with their corresponding visual cues to reveal the black-box mechanism of deep model behaviors. However, these pioneering attempts only either learn the category-specific prototypes and deteriorate their generalizing capacities, or demonstrate several illustrative examples without a quantitative evaluation of visual-based interpretability with further limitations on their practical usages. In this paper, we revisit the concept of visual words and propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules: semantic visual words learning and dual fidelity preservation. The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories. Beyond employing the visual words for prediction to align visual words with the base model, our dual fidelity preservation also includes the attention guided semantic alignment that encourages the learned visual words to focus on the same conceptual regions for prediction. Experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation over the state-of-the-art methods. Moreover, we elaborate on various in-depth analyses to further explore the learned visual words and the generalizability of our method for unseen categories.

Integrating Visual Word Embeddings into Translation Language Model for Keyword Spotting on Historical Mongolian Document Images

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Towards Semantic Embedding In Visual Vocabulary

Semantic Visualization for Short Texts with Word Embeddings

Multi-modal Auto-regressive Modeling via Visual Words

Learn More Manchu Words with A New Visual-Language Framework

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency

Language Models as Knowledge Bases for Visual Word Sense Disambiguation

Deep learning model for Mongolian Citizens Feedback Analysis using Word Vector Embeddings

A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings

Visual Grounding of Inter-lingual Word-Embeddings

Learnable Visual Words for Interpretable Image Recognition

Embedding Word Similarity with Neural Machine Translation

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Word Shape Matters: Robust Machine Translation with Visual Embedding

A Feature-Word-topic Model for Image Annotation and Retrieval

Visual language modeling for image classification.

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

Visually-Augmented Language Modeling

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions