B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

Shreyash Arya,Sukrut Rao,Moritz Böhle,Bernt Schiele

2024-11-02

Abstract:B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks (CNNs) and vision transformers (ViTs), which primarily replace linear layers with B-cos transformations, perform competitively to their respective standard variants while also yielding explanations that are faithful by design. However, it has so far been necessary to train these models from scratch, which is increasingly infeasible in the era of large, pre-trained foundation models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose 'B-cosification', a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to a pretrained CLIP model, and show that, even with limited data and compute cost, we obtain a B-cosified version that is highly interpretable and competitive on zero shot performance across a variety of datasets. We release our code and pre-trained model weights at <a class="link-external link-https" href="https://github.com/shrebox/B-cosification" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to convert existing pre-trained deep neural networks (DNNs) into models with intrinsic interpretability while maintaining or improving their performance and significantly reducing training costs. Specifically, the authors propose a method called "B-cosification," which fine-tunes existing pre-trained models to endow them with interpretability similar to B-cos models trained from scratch. B-cos models achieve high human interpretability through architectural modifications, such as using B-cos transformations instead of linear layers. However, training these models from scratch requires substantial computational resources and time, especially when dealing with large foundational models. Therefore, the authors explore how to leverage existing pre-trained weights to achieve similar effects through fine-tuning. The main contributions of the paper include: 1. **Proposing the B-cosification technique**: This is a novel technique that fine-tunes existing black-box DNNs into intrinsically interpretable B-cos DNNs, often outperforming standard DNNs and B-cos DNNs in terms of performance. 2. **Detailed study of design choices**: The authors conduct an exhaustive study of different design choices to find the optimal B-cosification strategy. 3. **Application to supervised image classifiers**: The authors apply B-cosification to supervised image classifiers on ImageNet, including CNNs and ViTs. The results show that B-cosified models perform comparably on interpretability metrics and usually outperform standard DNNs and B-cos DNNs in terms of accuracy. 4. **Extension to CLIP models**: The authors also apply B-cosification to pre-trained CLIP models (a large vision-language model). The results indicate that despite using limited data and computational resources, B-cosified CLIP models still exhibit high interpretability and excel in zero-shot performance. In summary, this paper aims to use the B-cosification technique to endow existing pre-trained models with higher interpretability while maintaining high performance, thereby reducing training costs and increasing model transparency.

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers

B-Cos Aligned Transformers Learn Human-Interpretable Features

Deeper Interpretability of Deep Networks

Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

This actually looks like that: Proto-BagNets for local and global interpretability-by-design

Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models

Transparent Projection Networks for Interpretable Image Recognition

PICNN: A Pathway towards Interpretable Convolutional Neural Networks

This Looks Like That: Deep Learning for Interpretable Image Recognition

Label-Free Concept Bottleneck Models

Interpretable Deep Convolutional Neural Networks via Meta-learning

Interpreting the decisions of CNNs via influence functions

InterpretCC: Intrinsic User-Centric Interpretability through Global Mixture of Experts

Interpretable Network Visualizations: A Human-in-the-Loop Approach for Post-hoc Explainability of CNN-based Image Classification

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

An inherently interpretable deep learning model for local explanations using visual concepts

AdaCBM: An Adaptive Concept Bottleneck Model for Explainable and Accurate Diagnosis

Learning Bottleneck Concepts in Image Classification

Interpretability of deep learning models: A survey of results