Abstract:Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at <a class="link-external link-https" href="https://github.com/alexjfoote/Neuron2Graph" rel="external noopener nofollow">this https URL</a>.

The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

Neuron to Graph: Interpreting Language Model Neurons at Scale

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

ABNIRML: Analyzing the Behavior of Neural IR Models

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability

NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Do Large Language Models Mirror Cognitive Language Processing?

Brain-like Functional Organization within Large Language Models

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

RecExplainer: Aligning Large Language Models for Explaining Recommendation Models

Examining the Role of Relationship Alignment in Large Language Models

Unlocking Emergent Modularity in Large Language Models

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

The Llama 3 Herd of Models

Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs

Aligners: Decoupling LLMs and Alignment