Flipping the Inference Stack — Robert Wachen, Etched

The Inference Singularity and Demand Surge 00:03

  • AI models are reaching a tipping point in quality and cost, driving an unstoppable surge in inference demand.
  • Over the past 18 months, projects have rapidly advanced from conception to impacting companies’ top line revenue, with companies willing to invest heavily in AI spend.
  • Predicted that within a decade, AI spend could represent double-digit percentages of global GDP.
  • Despite model advancements, all current AI providers are losing money on inference, including OpenAI Pro and many startups.
  • As enterprises move workloads to transformer models and on-premises solutions, financial and infrastructure challenges escalate, especially with peak load issues and underperforming hardware improvements.

Limits of Hardware Scaling and Rise of Custom Chips 02:07

  • Inference demand is growing exponentially, but compute density improvements have slowed since the Nvidia H100.
  • Recent hardware advances come mainly from adding more silicon area and increasing power consumption, rather than improving efficiency or raw performance.
  • The result is a widening gap between demand for inference and supply of economically viable compute.
  • There is an emerging trend towards creating fully custom hardware, including chips and entire data centers tailored for specific AI models.
  • In the next few years, each major model may get its own data center with custom chips, interconnects, racks, thermals, and power solutions.

Specialization in Hardware for AI Inference 03:49

  • At large scale (e.g., $100 billion spent on inference), it becomes cost-effective to build separate, specialized hardware for training and inference.
  • Specializing for specific data types (like float 4 for inference, float 8 for training) leads to significant gains but requires rethinking chip and system design.
  • Specialization changes workloads and affects everything from interconnects to power delivery and PCB design.
  • This drives a new market for end-to-end custom infrastructure for each generation of mega-models.

Etched’s Approach: The SOHU Transformer ASIC 05:46

  • Etched’s first product, SOHU, specializes in transformer inference, intentionally forgoing support for high-precision data types and training capabilities.
  • Their chip focuses only on features that remain stable across relevant models, making trade-offs not feasible for general-purpose hardware.
  • The approach is designed to be adaptable to future dominant architectures by focusing on specialization rather than just model-specific optimizations.
  • The first demonstration of SOHU's transformer-specific ASIC is expected later in the year, with significant impact anticipated.

Real-World Use Cases and Economic Bottlenecks 07:01

  • Real-time video generation was demonstrated with the Oasis AI-generated game, which quickly attracted over a million users but was limited by hardware to only 5,000 concurrent.
  • No current real-time streaming generative video API exists due to poor economics and extreme peak load challenges.
  • Real-time generative ads and gaming, where latency below 100ms unlocks new industries, are currently not economically feasible with existing hardware.
  • For real-time code generation (e.g. autocomplete), small models and low batch sizes are used to meet latency needs, but this leads to high cost and lower model quality.
  • Many enterprise workloads, such as telcos serving 100 million hourly users, face peak load multiples (5–10× average), making current hardware investments uneconomical.

Team, Scaling Up, and Market Adoption 11:10

  • Etched has grown to 150 people, attracting talent from leading AI chip and software companies.
  • The company has sold out its first production run, suggesting rapid market uptake.
  • High-profile hires include veterans who led platform and systems development at Nvidia, and software leadership from Google's TPU and DeepMind projects.
  • Former leaders from Google’s Astra project joined Etched, citing the economic and deployment advantage of its specialized approach.
  • Etched’s focus on transformer-specific hardware may be the only near-term solution capable of handling the increasing demands of AI inference at scale.