Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Introduction & The Need for Interpretability 00:00

  • Goodfire is an AI interpretability research company focused on understanding what happens inside neural networks.
  • The company's mission is motivated by AI's increasing use in mission-critical applications, requiring more reliability, safety, and control than "black box" approaches allow.
  • The goal is to enable intentional design, editing, and debugging of AI models, unlocking their "black box" nature for the first time.
  • Interpreting AI models allows for more certainty about their behavior and is compared to understanding drugs at a molecular level rather than just observing outcomes.

Can We Understand Neural Nets? 04:05

  • Eric Ho is optimistic about fully understanding neural networks, citing the advantage of complete access to all parameters, weights, and activations in AI compared to biological brains.
  • Partial understanding exists by reconstructing and extracting concepts from networks; the field aims to gradually improve this.
  • Full interpretability is seen as crucial for intentional AI design, drawing an analogy to how thermodynamics improved steam engine safety and design.

Impact on Neuroscience & Human Cognition 07:11

  • Insights from AI interpretability may help accelerate neuroscience by providing analogies and even parallels to human cognition.
  • Diverse thinking patterns (conceptual vs. linguistic) among humans are noted, possibly mirrored in neural network processing.
  • The concept of universality is discussed: similar patterns often emerge in different neural nets and even in comparison to biological systems (e.g., visual cortex similarities).

The Bonsai Analogy & Direct Editing 10:37

  • Goodfire envisions moving from passively "growing" neural networks to actively shaping and pruning them like bonsai trees—combining unsupervised growth with intentional interventions.
  • The approach aspires to map how training data shapes AI cognition and to enable both explanation ("why did you produce this output?") and direct modification to remove harmful or unwanted behaviors.

Approaches to Model Steering 13:50

  • Common techniques like prompt engineering and fine-tuning are considered crude and potentially introduce unintended side effects, as highlighted by studies showing odd emergent behaviors when models are fine-tuned with problematic data.
  • Fine-tuning can enhance not only the desired behaviors but also latent, undesirable connections due to the alien nature of neural network cognition.

Mechanistic Interpretability: Field Overview & Key Results 17:51

  • Mechanistic interpretability focuses on identifying features (concepts), circuits (combinations of features for higher-order concepts), and universality (similar circuits across models).
  • Notable progress includes "circuit threads" from OpenAI and contributions from labs such as Anthropic and DeepMind.
  • A major breakthrough was addressing superposition: single neurons representing multiple concepts, resolved by techniques like sparse autoencoders and monosemantic decomposition to yield "cleaner," interpretable concepts.

Superposition, Interpreter Models, and Scaling 20:04

  • Superposition is the phenomenon where neurons encode multiple concepts, a necessity due to limited dimensions relative to the number of concepts in large models.
  • Interpreter models are trained on activations from base models to isolate and clarify these overlapping concepts.
  • Auto-interpretability leverages language models to label and explain concepts found, with model quality improving as understanding improves.
  • These techniques are now scalable to very large models, enabling practical applications.

Real-World Applications & Biological Insights 26:07

  • Goodfire has partnered with Arc Institute to interpret models like EVO 2, a DNA sequence prediction model, uncovering known and potentially novel biological patterns.
  • The hope is that these techniques can lead to new understandings in genomics, such as identifying functions for previously "junk" DNA.
  • Editing successes have mostly involved language and image models; for example, the "Paint with Ember" demo allows targeted image generation edits by manipulating latent concepts.
  • Direct, surgical model edits without side effects remain challenging.

Interpretability for Governance, Auditing, and Open Models 31:00

  • Interpretability enables model auditing, identifying and removing undesirable behaviors in mission-critical deployments.
  • "Model diffing"—comparing checkpoints to track changes—is an emerging application, helping detect shifts in model properties such as increased sycophancy.
  • Interpretability could become essential for understanding, modifying, and ensuring the trustworthiness of open-source or internationally-developed models.

Team, Company Structure, and Ecosystem Role 34:27

  • Goodfire has assembled a team of leading figures from major labs and interpretability research, enabling a robust, independent approach.
  • Independence allows cross-domain and cross-architecture perspective, working with diverse partners beyond any single company’s ecosystem.
  • Collaboration includes Anthropic, which invested in Goodfire and shares the urgency for advancing interpretability before superintelligent AI emerges.

Future Outlook & Predictions 39:48

  • Interpretability is expected to be key in handling AI models from different jurisdictions and ensuring they align with desired properties.
  • Eric Ho predicts rapid AI progress, with interpretability likely required in legal and regulatory contexts as models impact society in unpredictable ways.
  • Looking ahead, Ho believes full interpretability and understanding of neural nets will be achieved by 2028, possibly requiring new conceptual paradigms beyond today’s frameworks.

Rapid-Fire Predictions & Closing Thoughts 42:18

  • Important future applications after code are expected in enterprise automation and employment-related impacts.
  • Recommended foundational reading is the original Circuits thread.
  • Noteworthy recent AI experiences involve models like 01 Pro, which appear to reason across business contexts.
  • AI's struggles with humor are noted, prompting hope that interpretability can shed light on these cognitive subtleties.
  • The conversation ends with a confident prediction that the field will reach a robust understanding of neural nets by 2028.