Abstract:Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE's effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE's utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at <a class="link-external link-https" href="https://github.com/Alice1998/RecSAE" rel="external noopener nofollow">this https URL</a>.

SPINE: SParse Interpretable Neural Embeddings

Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding

Interpretable Neural Embeddings with Sparse Self-Representation

Disentangling Dense Embeddings with Sparse Autoencoders

xSense: Learning Sense-Separated Sparse Representations and Textual Definitions for Explainable Word Sense Networks

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Lightweight Adaptation of Neural Language Models via Subspace Embedding

Learning Sparse Overcomplete Word Vectors Without Intermediate Dense Representations

SPARLING: Learning Latent Representations with Extremely Sparse Activations

Sparse word embeddings using l1 regularized online learning

The Interpretable Dictionary in Sparse Coding

SPINE: Soft Piecewise Interpretable Neural Equations

Sparse Overcomplete Word Vector Representations

Tackling Polysemanticity with Neuron Embeddings

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Biologically Plausible Sparse Temporal Word Representations

Local vs distributed representations: What is the right basis for interpretability?

XANE: eXplainable Acoustic Neural Embeddings

Neural Generators of Sparse Local Linear Models for Achieving both Accuracy and Interpretability

SPINE: Structural Identity Preserved Inductive Network Embedding.