Mechanistic Interpretability of Binary and Ternary Transformers

Jason Li
2024-05-28
Abstract:Recent research (<a class="link-https" data-arxiv-id="2310.11453" href="https://arxiv.org/abs/2310.11453">arXiv:2310.11453</a>, <a class="link-https" data-arxiv-id="2402.17764" href="https://arxiv.org/abs/2402.17764">arXiv:2402.17764</a>) has proposed binary and ternary transformer networks as a way to significantly reduce memory and improve inference speed in Large Language Models (LLMs) while maintaining accuracy. In this work, we apply techniques from mechanistic interpretability to investigate whether such networks learn distinctly different or similar algorithms when compared to full-precision transformer networks. In particular, we reverse engineer the algorithms learned for the toy problem of modular addition where we find that binary and ternary networks learn similar algorithms as full precision networks. This provides evidence against the possibility of using binary and ternary networks as a more interpretable alternative in the LLM setting.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem discussed in this paper is whether binary and ternary transformers can learn algorithms that are significantly different from or more interpretable than those learned by full-precision transformers. The research focuses on reverse-engineering the algorithms learned by these networks on the discrete problem of module addition using machine interpretability techniques. By comparing binary transformers with full-precision transformers, the paper finds that they learn similar algorithms, suggesting that binary and ternary networks may not be more interpretable alternatives in large language models (LLMs). Furthermore, experiments show that binary and ternary networks exhibit "grokking" behavior under weight decay regularization, and their learning algorithms exhibit periodic features similar to full-precision models but with some noise differences. In conclusion, the results of the paper provide evidence against the possibility of using binary and ternary networks to learn simpler and more interpretable algorithms.