Language Model as Visual Explainer

Xingyi Yang,Xinchao Wang
2024-12-09
Abstract:In this paper, we present Language Model as Visual Explainer LVX, a systematic approach for interpreting the internal workings of vision models using a tree-structured linguistic explanation, without the need for model training. Central to our strategy is the collaboration between vision models and LLM to craft explanations. On one hand, the LLM is harnessed to delineate hierarchical visual attributes, while concurrently, a text-to-image API retrieves images that are most aligned with these textual concepts. By mapping the collected texts and images to the vision model's embedding space, we construct a hierarchy-structured visual embedding tree. This tree is dynamically pruned and grown by querying the LLM using language templates, tailoring the explanation to the model. Such a scheme allows us to seamlessly incorporate new attributes while eliminating undesired concepts based on the model's representations. When applied to testing samples, our method provides human-understandable explanations in the form of attribute-laden trees. Beyond explanation, we retrained the vision model by calibrating it on the generated concept hierarchy, allowing the model to incorporate the refined knowledge of visual attributes. To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations, demonstrating its plausibility, faithfulness, and stability.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to explain the internal operation mechanisms of vision models in a human - understandable way. Specifically, the author proposes a method named "Language Model as Visual Explainer (LVX)", aiming to analyze the decision - making process of vision models through tree - structured language explanations without retraining the models. ### Main problems and goals of the paper 1. **Explaining the internal operation of vision models**: - Current vision models (such as deep neural networks) show powerful performance when processing images, but their internal decision - making processes are often black - box - like and difficult for humans to understand. - Existing explanation methods (such as attribution methods, mechanistic interpretability, prototype analysis, etc.) provide some explanations, but usually require expert verification or explanation of the output, and are difficult for non - technical personnel to understand directly. 2. **Generating human - understandable explanations**: - The author hopes to generate a structured language explanation, especially a tree - structured explanation, so that non - technical personnel can also understand the decision - making process of vision models. - This explanation is not limited to highlighting certain pixels or features, but describes visual attributes and their hierarchical relationships through natural language. 3. **Combining language models and vision models**: - The author utilizes large - language models (LLM) to generate text descriptions of visual attributes and obtains images matching these descriptions through text - to - image APIs. - By mapping these texts and images into the embedding space of the vision model, a hierarchical visual embedding tree is constructed, thereby achieving the explanation of the vision model. 4. **Dynamically adjusting the explanation tree**: - The explanation tree is dynamically adjusted according to the performance of the vision model, removing infrequently visited nodes (i.e., attributes that are difficult for the model to recognize) and expanding frequently visited nodes (i.e., attributes that the model successfully recognizes). - This dynamic adjustment enables the explanation tree to better reflect the actual performance of the vision model. 5. **Evaluating and improving model performance**: - The author also proposes to use the generated explanations to calibrate the vision model, thereby improving the performance and reliability of the model. - For this purpose, they introduce new benchmark datasets and evaluation metrics to test the effectiveness, credibility, and stability of the LVX method. ### Summary The core problem of the paper is to analyze the decision - making process of vision models through tree - structured language explanations, making the explanations more intuitive and easier to understand. The LVX method proposed by the author combines the advantages of language models and vision models, generates a hierarchical tree of visual attributes, and improves the quality of explanations and the performance of the model through dynamic adjustment and calibration.