Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability
Introduction & The Need for Interpretability 00:00
Goodfire is an AI interpretability research company focused on understanding what happens inside neural networks.
The company's mission is motivated by AI's increasing use in mission-critical applications, requiring more reliability, safety, and control than "black box" approaches allow.
The goal is to enable intentional design, editing, and debugging of AI models, unlocking their "black box" nature for the first time.
Interpreting AI models allows for more certainty about their behavior and is compared to understanding drugs at a molecular level rather than just observing outcomes.
Eric Ho is optimistic about fully understanding neural networks, citing the advantage of complete access to all parameters, weights, and activations in AI compared to biological brains.
Partial understanding exists by reconstructing and extracting concepts from networks; the field aims to gradually improve this.
Full interpretability is seen as crucial for intentional AI design, drawing an analogy to how thermodynamics improved steam engine safety and design.
Insights from AI interpretability may help accelerate neuroscience by providing analogies and even parallels to human cognition.
Diverse thinking patterns (conceptual vs. linguistic) among humans are noted, possibly mirrored in neural network processing.
The concept of universality is discussed: similar patterns often emerge in different neural nets and even in comparison to biological systems (e.g., visual cortex similarities).
Goodfire envisions moving from passively "growing" neural networks to actively shaping and pruning them like bonsai trees—combining unsupervised growth with intentional interventions.
The approach aspires to map how training data shapes AI cognition and to enable both explanation ("why did you produce this output?") and direct modification to remove harmful or unwanted behaviors.
Common techniques like prompt engineering and fine-tuning are considered crude and potentially introduce unintended side effects, as highlighted by studies showing odd emergent behaviors when models are fine-tuned with problematic data.
Fine-tuning can enhance not only the desired behaviors but also latent, undesirable connections due to the alien nature of neural network cognition.
Mechanistic Interpretability: Field Overview & Key Results 17:51
Mechanistic interpretability focuses on identifying features (concepts), circuits (combinations of features for higher-order concepts), and universality (similar circuits across models).
Notable progress includes "circuit threads" from OpenAI and contributions from labs such as Anthropic and DeepMind.
A major breakthrough was addressing superposition: single neurons representing multiple concepts, resolved by techniques like sparse autoencoders and monosemantic decomposition to yield "cleaner," interpretable concepts.
Superposition, Interpreter Models, and Scaling 20:04
Superposition is the phenomenon where neurons encode multiple concepts, a necessity due to limited dimensions relative to the number of concepts in large models.
Interpreter models are trained on activations from base models to isolate and clarify these overlapping concepts.
Auto-interpretability leverages language models to label and explain concepts found, with model quality improving as understanding improves.
These techniques are now scalable to very large models, enabling practical applications.
Goodfire has partnered with Arc Institute to interpret models like EVO 2, a DNA sequence prediction model, uncovering known and potentially novel biological patterns.
The hope is that these techniques can lead to new understandings in genomics, such as identifying functions for previously "junk" DNA.
Editing successes have mostly involved language and image models; for example, the "Paint with Ember" demo allows targeted image generation edits by manipulating latent concepts.
Direct, surgical model edits without side effects remain challenging.
Interpretability for Governance, Auditing, and Open Models 31:00
Interpretability enables model auditing, identifying and removing undesirable behaviors in mission-critical deployments.
"Model diffing"—comparing checkpoints to track changes—is an emerging application, helping detect shifts in model properties such as increased sycophancy.
Interpretability could become essential for understanding, modifying, and ensuring the trustworthiness of open-source or internationally-developed models.
Interpretability is expected to be key in handling AI models from different jurisdictions and ensuring they align with desired properties.
Eric Ho predicts rapid AI progress, with interpretability likely required in legal and regulatory contexts as models impact society in unpredictable ways.
Looking ahead, Ho believes full interpretability and understanding of neural nets will be achieved by 2028, possibly requiring new conceptual paradigms beyond today’s frameworks.