Abstract:Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at <a class="link-external link-https" href="https://github.com/alexjfoote/Neuron2Graph" rel="external noopener nofollow">this https URL</a>.

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi

NeuralVis: Visualizing and Interpreting Deep Learning Models

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning

Toward Collaborative Inferencing of Deep Neural Networks on Internet-of-Things Devices

NeuronFair

TBD: Benchmarking and Analyzing Deep Neural Network Training

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Toward Scalable and Privacy-preserving Deep Neural Network Via Algorithmic-Cryptographic Co-design

iNNspector: Visual, Interactive Deep Model Debugging

PrivyNet: A Flexible Framework for Privacy-Preserving Deep Neural Network Training

Nets4Learning: A Web Platform for Designing and Testing ANN/DNN Models

Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks

Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity

LightRidge: An End-to-end Agile Design Framework for Diffractive Optical Neural Networks

NAS-LID: Efficient Neural Architecture Search with Local Intrinsic Dimension

Towards Scalable and Privacy-Preserving Deep Neural Network via Algorithmic-Cryptographic Co-design

Neuron to Graph: Interpreting Language Model Neurons at Scale

Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs

An In-Situ Visual Analytics Framework for Deep Neural Networks