Neurons to words: a novel method for automated neural network interpretability and alignment

Puglisi, L.S.; Valdés, F.; Metzger, J.J.

Neurons to words: a novel method for automated neural network interpretability and alignment

Tools

Item Type:	Conference or Workshop Item
Title:	Neurons to words: a novel method for automated neural network interpretability and alignment
Creators Name:	Puglisi, L.S., Valdés, F. and Metzger, J.J.
Abstract:	Recent years have witnessed an increase in the parameter size of frontier AI models by multiple orders of magnitude. This trend is driven by empirical observations, known as scaling laws, which show that model performance scales with model size, dataset size, and computational power. Motivated by this, researchers are training ever-larger models in pursuit of unlocking new capabilities. However, the growing complexity of these models makes understanding their inner workings increasingly challenging. Interpretability is crucial not only in f ields like medicine and biotechnology, where understanding the internals of these models could lead to new insights but also in super alignment, where it is the goal to ensure that AI is aligned and acts according to human values and interests. We present a generic, scalable first-of-its-kind method for automatically interpreting neural networks. In a proof-ofconcept study we establish the viability of converting neural network activations- here for the first layer of a Convolutional Neural Network- into human-readable language. Additionally, we propose modifications to scale this method for understanding neural networks of any size. In anticipation of more capable large language models, this method could enable the monitoring of their internal mechanisms and decisions.
Source:	Proceedings of the AAAI Conference on Artificial Intelligence
Title of Book:	Thirty-Ninth AAAI Conference on Artificial Intelligence - Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence - Fifteenth Symposium on Educational Advances in Artificial Intelligence
ISSN:	2159-5399
ISBN:	978-1-57735-897-8
Publisher:	Association for the Advancement of Artificial Intelligence
Volume:	39
Number:	26
Page Range:	27591-27598
Number of Pages:	8
Date:	11 April 2025
Official Publication:	https://doi.org/10.1609/aaai.v39i26.34972

Repository Staff Only: item control page