David Bau

Interpretable Neural Networks

Northeastern University Khoury College of Computer Sciences

A brief interview with David: why we study deep network internals.

I am an Assistant Professor of Computer Science at Northeastern Khoury College. My lab studies the structure and interpretation of deep networks.

We think that understanding the rich internal structure of deep networks is a grand and fundamental research question with many practical implications.

We aim to lay the groundwork for human-AI collaborative software engineering, where humans and machine-learned models both teach and learn from each other.

Want to come to Boston to work on deep learning with me? Apply to Khoury here and contact me if you are interested in joining as a graduate student or postdoc. Also check out NDIF engineering fellowships.

Curriculum Vitae. (PhD MIT EECS, thesis; Cornell; Harvard; Google; Microsoft.)
Publication pages on Dblp and Google Scholar.

In the News

How does ChatGPT think? "Researchers are striving to reverse-engineer artificial intelligence and scan the 'brains' of LLMs to see what they are doing, how and why... Researchers want explanations so that they can create safer, more efficient and more accurate AI. Users want explanations so that they know when to trust a chatbot's output. And regulators want explanations so that they know what AI guard rails to put in place.... Bau and his colleagues have also developed methods to scan and edit AI neural networks, including a technique they call causal tracing...." This news article in Nature surveys the emerging field of deep network interpretation. The increasing complexity of generative AI systems such as ChatGPT has spawned a new research subfield that cracks open these large models and interprets their emergent internal structure. The article surveys perspectives on this new research area from several relevant researchers including David Bau, Mor Geva, Martin Wattenberg, Chris Olah, Thilo Hagendorff, Sam Bowman, Sandra Wachter, Andy Zou, and Peter Hase. The article also highlights research from Kenneth Li, Jason Wei, Miles Turpin, Kevin Meng and Roger Grosse. Nature news feature by Matthew Hutson, May 14, 2024.

We have launched the National Deep Inference Fabric (NDIF) project. Large-scale AI presents fundamental open scientific questions and major societal impacts that are not yet well-understood—and they are both difficult and expensive to study. NDIF is a major investment in scientific infrastructure to help meet the challenge, with $9m of funding from the National Science Foundation to develop large-scale AI inference software aimed at enabling cutting-edge research. Questions in the public interest, such as "how can we explain an AI decision?" or "what can improve the safety and robustness of AI?" The goal of NDIF is to provide a robust and transparent AI inference service to enable scientists in every part of the country in every field touched by AI, to expand, accelerate, and democratize impactful AI science. (Programmers interested in the technical details can pip install nnsight to try NDIF today.) Users can remotely access and alter activations, gradients, and customize any step of large models like llama3-70b like having a 70b model on your own laptop. This transparency goes far behyond commercial AI inference services, and NSF funding will expand NDIF to support every open model and a broad range of research methods. Read about the positions NDIF is hiring to fill on LinkedIn, Twitter/X, and on the NDIF website. NSF funds groundbreaking research led by Northeastern to democratize artificial intelligence. Northeastern Global News, May 2, 2024.

Selected Projects

Linearity of Relation Decoding in Transformer LMs. What is the right level of abstraction to use when understanding a huge network? While it is natural to examine individual neurons, attention heads, modules, and representation vectors, we should also ask whether taking a holistic view of a larger part of the network can reveal any higher-level structure. In this work, we ask how relationships between entities and their attributes are represented, and we measure the power of the Jacobian—the matrix derivative—to capture the action of a range of transformer layers in applying a relation to an entity. When a representation vector passes through a range of transformer layers, it is subjected to a very nonlinear transformation. Yet in this paper we find that when the network resolves a specific relationship such as person X plays instrument Y, the action of the transformer from the vector for X to the vector for Y will often be essentially linear, suggesting that the information about Y is already present in X. Moreover the linear operator can be extracted by examining the Jacobian using as few as a single example of the relation. We analyze more than 40 different relations to determine which have a linear representation, and we introduce a tool, the attribute lens that exploits linearity to visualize the relational information carried in a state vector. E Hernandez, A Sen Sharma, T Haklay, K Meng, M Wattenberg, J Andreas, Y Belinkov, D Bau. Linearity of Relation Decoding in Transformer Language Models. ICLR 2024 (spotlight).

Fine-Tuning Enhances Existing Mechanisms. When you fine-tune an LLM, are you teaching it something new or exposing what it already knows? In this work, we pin down the detailed structure of the mechanisms for an entity-tracking task using new patching techniques, revealing a pre-existing circuit when a capability emerges from fine-tuning. The paper applies path-patching causal mediation methods as used in Wang 2022 (IOI) to identify the components for a circuit for entity tracking that emerges after fine-tuning. Interestingly, we find that the components already existed in the model prior to fine-tuning. Furthermore we use our DCM patching method to deduce the type of information being transmitted at most of the steps before and after fine-tuning, and find that the role of the information is unchanged under fine-tuning. Finally, we introduce Cross-Model Activation Patching (CMAP) to test whether the encoding of information is changed after fine-tuning, and we find that the encodings are compatible, not only allowing interchange, but also revealing that improved task performance can be obtained by directly patching model activations between models. N Prakash, T R Shaham, T Haklay, Y Belinkov, D Bau. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. ICLR 2024.

Function Vectors in Large Language Models. The idea of treating a function reference as data is one of the most powerful concepts in computer science, enabling complex computational forms. Do neural networks learn to represent functions as data? In this paper, we study in-context-learning inside large transformer language models and show evidence that vector representations of functions appear. Function vectors (FVs) emerge when a language model generalizes a list of demonstrations of input-output pairs (via in-context learning, ICL). To study how ICL works, we apply causal mediation analysis to identify attention heads that transport information that determines the task to execute. This analysis reveals a small number of attention heads that transport a vector which we call a function vector (FV), that generically encodes the task. We study the properties of FVs, finding that they can trigger execution of the function when injected into very different contexts including natural text. We find that FVs seem to directly encode the word embeddings of the output space, and that they also trigger nontrivial transformer calculations that differ from word-vector arithmetic. FVs are able to obey semantic vector algebra, but rather than operating on word embeddings, they enable compositions of function execution. E Todd, M L Li, A Sen Sharma, A Mueller, B C Wallace, D Bau. Function Vectors in Large Language Models. ICLR 2024.

Unified Concept Editing in Diffusion Models. Text-to-image diffusion models such as Stable Diffusion have many issues that limit their suitability for real-world deployment: they amplify racial and gender biases; they imitate copyrighted images; and they generate offensive content. We introduce a method, Unified Concept Editing, that allows precise editing of many concepts within a diffusion model, and we show that it can be used to reduce bias, copyright, and offensive content issues simultaneously. Our UCE method is a generalization and improvement upon the ROME, MEMIT, and TIME methods. It modifies the associations between textual concepts and visual concepts by directly editing the cross-attention parameters in the diffusion model without any additional training images. Its closed-form parameter modification explicitly applies an optimal change to sets of concepts while protecting other sets of concepts from unintended modification. The paper compares UCE to previous state-of-the-art erasure, debiasing, and offensive image removal methods and shows that our unified editing method outperforms previous separate approaches by a significant margin. R Gandikota, H Orad, Y Belinkov, J Materzyńska, D Bau. Unified Concept Editing in Diffusion Models. WACV 2024.

Future Lens. Autoregressive language models like GPT are trained to predict the next word. But we found they are also often thinking several further tokens ahead! In this work, we measure this future information, and we show how to extend the logit lens to reveal a run of future anticipated tokens from individual transformer hidden states. Our paper experiments with several ways to decode future tokens from a single hidden state. Inspired by "tuned lens" methods from Belrose and Yom Din that skip to future layers, we first try training a simple linear readout model. We also try transplanting the hidden state into the context of a prompt specially chosen to evoke future output. Using a tuned prompt reveals that two-ahead tokens can be predicted with more than 48% accuracy, which is good enough to be useful for "future lens" visualizations. K Pal, J Sun, A Yuan, B C Wallace, D Bau. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. CoNLL 2023.

Erasing Concepts from Diffusion Models. We propose a method for fine-tuning model weights to erase concepts from diffusion models using their own knowledge. Given just the text of the concept to be erased, our method can edit the model weights to erase the concept while minimizing the inteference with other concepts. This type of fine-tuning has an advantage over previous methods: it is not easy to circumvent because it modifies weights, yet it is fast and practical because it avoids the expense of retraining the whole model on filtered training data. The ESD method erases a concept by using the original model's own knowledge of the concept as a guide while training a modified model. Rather than guiding the new model to imitate the original exactly, the new model is guided to imitate the original, while reversing its behavior of the selected concept. This objective corresponds to an exact ratio of probability distributions, and is straightforward to compute, equivalent to an application of classifier-free guidance for training rather than inference. Our paper studies the ESD method as used to tune different sets of parameters, and as applied to a variety of concept-erasure applications including artistic style removal, offensive-image removal, and removing knowledge of object classes. R Gandikota, J Materzyńska, Jaden Fiotto-Kaufman, D Bau. Erasing Concepts from Diffusion Models. ICCV 2023.

Locating and Editing Factual Associations in GPT. In this project, we show that factual knowledge within GPT also corresponds to a localized computation that can be directly edited. For example, we can make a small change to a small set of the weights of GPT-J to teach it the counterfactual "Eiffel Tower is located in the city of Rome." Rather than merely regurgitating the new sentance, it will generalize that specific counterfactual knowledge and apply it in very different linguistic contexts. To show that factual knowledge within a GPT model corresponds to a simple, localized, and directly editable computation, we introduce three new concepts. (1) We introduce Causal Tracing, a method to locate decisive information within a network by corrupting and restoring hidden neural states; traces reveal how information about a fact is retrieved by MLP layers in the network. (2) We show how to apply rank-one matrix edits (ROME) to change individual memories within an MLP module within a transformer. (3) And we show how to distinguish between generalized factual knowledge and rote regurgitation of a fact, using a new data set called CounterFact. K Meng* and D Bau*, A Andonian, Y Belinkov. Locating and Editing Factual Associations in GPT. NeurIPS 2022.

Rewriting a Deep Generative Model. Deep network training is a blind optimization procedure where programmers define objectives but not the solutions that emerge. In this paper we ask if deep networks can be created in different way: we ask if a user can directly rewrite the rules of a model. We develop a method and a user interface that allows simple visual editing of the rules of a GAN, and demonstrate direct editing of high-level rules in pretrained state-of-the-art StyleGANv2 models. Our method finds connections between modern large network layers and the classic neural data structure of Optimal Linear Associative Memory, and shows that it is feasible for a person to directly edit the weights of a large model to change its behavior, without training againist a data set, by understanding the model's internal structure. D Bau, S Liu, TZ Wang, JY Zhu, A Torralba. Rewriting a Deep Generative Model. ECCV 2020 oral.

Understanding the Role of Individual Units in a Deep Neural Network. The causal role of individual units within a deep network can be measured by directly changing those units and observing the impact. In this study, we unify and extend the netdissect and gandissect methods to compare and understand classifiers and generators. Removing sets of units from a classifier reveals a sparse computational structure: we find that a small set of neurons is important for the accuracy of an individual classifier output class, and that neurons that are important for more classes also are more human-interpretable. D Bau, JY Zhu, H Strobelt, A Lapedriza, B Zhou, A Torralba. Understanding the Role of Individual Units in a Deep Neural Network. Proceedings of the National Academy of Sciences. 2020.

Structure and Interpretation of Deep Networks. Most introductory courses on deep networks focus on how to train models, but it is just as important to understand the structure and behavior of the models after training is done. By bringing current research in deep network interpretation to students, the SIDN course is designed to start filling that gap. Organized with Yonatan Belinkov, Julius Adebayo and a group of a dozen brilliant speakers, our course covered salience methods, global model analysis, adversarial robustness, fairness, interactive methods, and natural language explanations. Each topic is anchored by a set of hands-on exercises in Colab notebooks that are posted online for students to work through and explore. Organizers D Bau, Y Belinkov, J Adebayo, H Strobelt, A Ross, V Petsiuk, S Gehrmann, M Suzgun, S Santurkar, D Tsipras, I Chen, J Mu, J Andreas. Structure and Interpretation of Deep Networks. 2020 IAP Course at MIT.

Seeing what a GAN Cannot Generate studies mode dropping by asking the inverse question: how can we decompose and understand what a GAN cannot do? A core challenge faced by GANs is mode dropping or mode collapse, which is the tendendency for a GAN generator to focus on a few modes and omit other parts of the distribution. State-of-the-art GANs apply training methods designed to reduce mode collapse, but analyzing the phenomenon remains difficult for large distributions: examination of output samples reveals what a GAN can do, not what it cannot do. So in this paper we develop a pair of complementary methods for decomposing what GAN omits, looking at segmentation statistics over a distribution, and also visualizing omissions in specific instances by calculating inversions of a GAN generator. Surprisingly, we find that a state-of-the-art GAN will sometimes cleanly omit whole classes of objects from its output, hiding these omissions by creating realistic instances without those objects. D Bau, JY Zhu, J Wulff, W Peebles, H Strobelt, B Zhou, A Torralba. Seeing What a GAN Cannot Generate. ICCV 2019 oral.

GAN Paint applies GAN dissection to the manipulation of user-provided real photographs. By encoding a scene into a representation that can be rendered by a generator network derived from a GAN, a user can manipulate photo semantics, painting objects such as doors, windows, trees, and domes. The details of rendering objects in plausible configurations is left to the network. Our previous GAN dissection work showed how to manipulate random synthetic images generated by an unconditional GAN. To manipulate a real photograph X instead, the generator must be guided to reproduce the photograph faithfully. While previous work has investigated finding the best input z so that G(z)≈X, we show that it is useful to also optimize the parameters of G itself. Even in cases where the GAN is not capable of rendering the details of the user-provided photo, a nearby GAN generator can be found that does. We implemented our algorithm using an interactive painting app at ganpaint.io. D Bau, H Strobelt, W Peebles, J Wulff, B Zhou, JY Zhu, A Torralba. Semantic Photo Manipulation with a Generative Image Prior. In SIGGRAPH 2019.

GAN Dissection investigates the internals of a GAN, and shows how neurons can be directly manipulated to change the behavior of a generator. Here we ask whether the apparent structure that we found in classifiers also appears in a setting with no supervision from labels. Strikingly, we find that a state-of-the-art GAN trained to generate complex scenes will learn neurons that are specific to types of objects in the scene, such as neurons for trees, doors, windows, and rooftops. The work shows how to find such neurons, and shows that by forcing the neurons on and off, you can cause a generator to draw or remove specific types of objects in a scene. D Bau, JY Zhu, H Strobelt, B Zhou, JB Tenenbaum, WT Freeman, A Torralba. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. In ICLR 2019.

Network Dissection is a technique for quantifying and automatically estimating the human interpretability (and interpretation) of units within any deep neural network for vision. Building upon a surprising 2014 finding by Bolei Zhou, network dissection defines a dictionary of 1197 human-labeled visual concepts, each represented as a segmentation problem, then it estimates interpretability by evaluating each hidden convolutional unit as a solution to those problems. I have used network dissection to reveal that representation space is not isotropic: learned representations have an unusually high agreement with human-labeled concepts that vanishes under a change in basis. We gave an oral presentation about the technique and the insights it provides at CVPR 2017. D Bau, B Zhou, A Khosla, A Oliva, and A Torralba. Network Dissection: Quantifying the Intepretability of Deep Visual Representations. CVPR 2017 oral.

Blocks and Beyond is a workshop I helped organize to bring together researchers who are investigating blocked-based interfaces to simplify programming for novices and casual programmers. The workshop was oversubscribed, and the presented work was interesting both for its breadth and depth. Afterwards, we wrote a review paper to survey the history, foundations, and state-of-the-art in the field. The review appears in the June 2017 Communications of the ACM; also see the video overview. D Bau, J Gray, C Kelleher, J Sheldon, F Turbak. Learnable Programming: Blocks and Beyond. Communications of the ACM, pp. 72-80. June 2017.

Pencil Code is an open-source project that makes it easier for novice programmers to work with professional programming languages. Developed together with my son and with the generous support of Google, this system provides a blocks-based editing environment with turtle graphics on a canvas that smoothly transitions to text-based editing of web applications using jQuery. Two thousand students use the system each day. A study of middle-school students using the environment suggests suggests the block-and-text transitions are an aid to learning. D Bau, D A Bau, M Dawson, C S Pickens. Pencil code: block code for a text world. In Proceedings of the 14th International Conference on Interaction Design and Children, pp. 445-448. ACM, 2015.

Google Image Search is the world's largest searchable index of images. I contributed several improvements to this product, including improved ranking for recent images, a clustered broswing interface for discovering images using related searches, a rollout of new serving infrastructure to support a long-scrolling result page serving one thousand image results at a time, and improvements in the understanding of person entities on the web. M Zhao, J Yagnik, H Adam, D Bau, Large scale learning and recognition of faces in web videos. In Automatic Face & Gesture Recognition, 2008. FG'08. 8th IEEE International Conference on (pp. 1-7). IEEE, September 2008.

Google Talk is a web-based chat solution that was built-in to GMail. I led the team to create Google Talk in an (ultimately unsuccessful) attempt to establish a universal federated open realtime communication ecosystem for the internet. Our messaging platform provided full-scale support for XMPP and Jingle, which are open standards for federating real-time chat and voice that are analogous to the open-for-all SMTP system for email. When these open protocols came under asymmetric attack by Microsoft (they provided only one-way compatibility), Google relented and reverted to a closed network. To this day, open realtime communications remains an unfulfilled dream for the internet. D Bau. Google Gets to Talking. Google Official Blog, August 2005.

Apache XML Beans is an open-source implementation of the XML Schema specification as a compiler from schema types to Java classes. Still used as a powerful document interchange technology, my team's implementation of this standard is a good example of an important approach that continues to be a key technique for the creation of understandably complex systems: the prioritization of faithful and transparent data representations over simplified but opaque functional encapsulations. D Bau. The Design of XML Beans, davidbau.com, a dabbler's weblog, November 2003.

Microsoft Internet Explorer 4 was the first AJAX web browser. As part of the Trident team led by Adam Bosworth, I helped create the first fully mutable HTML DOM by defining its asynchronous loading model. My contribution was to implement an incremental HTML parser that uses speculative lookahead to drive a fast multithreaded preloader for linked resources, while maintaining a consistent view of programmable elements for single-threaded scripts that can change the document during loading. The design of the system resolved tensions between performance, flexiblity, and programmability, and contributed to the strength of the modern web platform.

Numerical Linear Algebra is the graduate textbook on numerical linear algebra I wrote with my advisor Nick Trefethen while earning a Masters at Cornell. The book began as a detailed set of notes that I took while attending Nick's course. The writing is intended to capture the spirit of his teaching: succinct and insightful. The hope is to reveal the elegance of this family of fundamental algorithms and dispel the myth that finite-precision arithmetic means imprecise thinking. L N Trefethen, D Bau. Numerical linear algebra. Vol. 50. Siam, 1997.