Item Type: | Conference or Workshop Item |
---|---|
Title: | Neurons to words: a novel method for automated neural network interpretability and alignment |
Creators Name: | Puglisi, L.S., Valdés, F. and Metzger, J.J. |
Abstract: | Recent years have witnessed an increase in the parameter size of frontier AI models by multiple orders of magnitude. This trend is driven by empirical observations, known as scaling laws, which show that model performance scales with model size, dataset size, and computational power. Motivated by this, researchers are training ever-larger models in pursuit of unlocking new capabilities. However, the growing complexity of these models makes understanding their inner workings increasingly challenging. Interpretability is crucial not only in f ields like medicine and biotechnology, where understanding the internals of these models could lead to new insights but also in super alignment, where it is the goal to ensure that AI is aligned and acts according to human values and interests. We present a generic, scalable first-of-its-kind method for automatically interpreting neural networks. In a proof-ofconcept study we establish the viability of converting neural network activations- here for the first layer of a Convolutional Neural Network- into human-readable language. Additionally, we propose modifications to scale this method for understanding neural networks of any size. In anticipation of more capable large language models, this method could enable the monitoring of their internal mechanisms and decisions. |
Source: | Proceedings of the AAAI Conference on Artificial Intelligence |
Title of Book: | Thirty-Ninth AAAI Conference on Artificial Intelligence - Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence - Fifteenth Symposium on Educational Advances in Artificial Intelligence |
ISSN: | 2159-5399 |
ISBN: | 978-1-57735-897-8 |
Publisher: | Association for the Advancement of Artificial Intelligence |
Volume: | 39 |
Number: | 26 |
Page Range: | 27591-27598 |
Number of Pages: | 8 |
Date: | 11 April 2025 |
Official Publication: | https://doi.org/10.1609/aaai.v39i26.34972 |
Repository Staff Only: item control page