The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Stephen Casper,Jieun Yun,Joonhyuk Baek,Yeseong Jung,Minhwan Kim,Kiwan Kwon,Saerom Park,Hayden Moore,David Shriver,Marissa Connor,Keltin Grimes,Angus Nicolson,Arush Tagade,Jessica Rumbelow,Hieu Minh Nguyen,Dylan Hadfield-Menell
2024-04-04
Abstract:Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the interpretability of Convolutional Neural Networks (CNNs) on the ImageNet scale, especially to help humans identify trojans in CNNs. Specifically, the paper solicited new methods to study and interpret trojans in CNNs by holding the SaTML 2024 CNN Interpretability Competition. The goal of the competition was to help human crowd - workers identify trojans in CNNs. ### Main Problems 1. **Improving the Interpretability of CNNs**: - Research on how to make humans better understand the working mechanism of CNNs, especially their behavior when dealing with new types of inputs. - Use interpretability tools to help humans identify and diagnose trojans in CNNs. 2. **Identifying Trojans**: - A trojan is a specific vulnerability that causes a CNN to produce unexpected outputs when certain trigger features appear. - The competition aims to develop new methods to help human crowd - workers reliably identify these trojans. ### Background - **Deployment of AI Systems in High - Risk Environments**: Deploying AI systems in high - risk environments requires effective tools to ensure their trustworthiness. - **Advantages of Interpretability Tools**: Unlike test sets, interpretability tools can sometimes allow humans to describe the behavior of the network on new examples. - **Shortcomings of Existing Research**: Although interpretability tools are helpful for better supervision, human understanding is difficult to measure and it is hard to make clear progress. ### Competition Details - **Benchmark**: The competition is based on a benchmark introduced by Casper et al. (2023), which aims to help crowd - workers discover trojans with human - interpretable triggers. - **Types of Trojans**: The competition uses 12 trojans, divided into three categories: Patch, Style, and Natural Feature. - **Evaluation Method**: Contestants need to generate 10 visualizations or text descriptions to help human crowd - workers identify 12 non - secret trojans. ### Methods - **Prototype Generation (PG)**: Generate input images through feature synthesis, regularization, and diversity goals to maximize the activation of specific neurons. - **Text - based Concept Activation Vectors (TextCAVs)**: A text - based interpretability method that generates concept vectors through CLIP embeddings and evaluates the model's sensitivity to specific concepts. - **Feature Embedding Using Diffusion (FEUD)**: Combine reverse - engineering defense and generative AI to generate interpretable representations of CNN trojans. - **Fine - Tuning Robust Feature - Level Adversarial Generators (RFLA - Gen2)**: Generate patches that may cause the model to misclassify by training an image generator, thereby visualizing trojans. ### Discussion - **Diversity of Methods**: Each participating method has its own unique advantages, and no method can completely dominate the others. - **Success in Identifying Trojans**: Yun et al. and Nicolson successfully identified all four secret trojans, indicating that modern interpretability methods have made significant progress in some aspects. - **Future Directions**: Future research can apply similar methods to other types of networks (such as language models) and test and apply them in practical problems. In conclusion, this paper promotes the research on CNN interpretability through holding a competition and provides new methods and techniques for identifying and diagnosing trojans.