AI models are reaching a tipping point in quality and cost, driving an unstoppable surge in inference demand.
Over the past 18 months, projects have rapidly advanced from conception to impacting companies’ top line revenue, with companies willing to invest heavily in AI spend.
Predicted that within a decade, AI spend could represent double-digit percentages of global GDP.
Despite model advancements, all current AI providers are losing money on inference, including OpenAI Pro and many startups.
As enterprises move workloads to transformer models and on-premises solutions, financial and infrastructure challenges escalate, especially with peak load issues and underperforming hardware improvements.
Limits of Hardware Scaling and Rise of Custom Chips 02:07
Inference demand is growing exponentially, but compute density improvements have slowed since the Nvidia H100.
Recent hardware advances come mainly from adding more silicon area and increasing power consumption, rather than improving efficiency or raw performance.
The result is a widening gap between demand for inference and supply of economically viable compute.
There is an emerging trend towards creating fully custom hardware, including chips and entire data centers tailored for specific AI models.
In the next few years, each major model may get its own data center with custom chips, interconnects, racks, thermals, and power solutions.
At large scale (e.g., $100 billion spent on inference), it becomes cost-effective to build separate, specialized hardware for training and inference.
Specializing for specific data types (like float 4 for inference, float 8 for training) leads to significant gains but requires rethinking chip and system design.
Specialization changes workloads and affects everything from interconnects to power delivery and PCB design.
This drives a new market for end-to-end custom infrastructure for each generation of mega-models.
Etched’s Approach: The SOHU Transformer ASIC 05:46
Etched’s first product, SOHU, specializes in transformer inference, intentionally forgoing support for high-precision data types and training capabilities.
Their chip focuses only on features that remain stable across relevant models, making trade-offs not feasible for general-purpose hardware.
The approach is designed to be adaptable to future dominant architectures by focusing on specialization rather than just model-specific optimizations.
The first demonstration of SOHU's transformer-specific ASIC is expected later in the year, with significant impact anticipated.
Real-World Use Cases and Economic Bottlenecks 07:01
Real-time video generation was demonstrated with the Oasis AI-generated game, which quickly attracted over a million users but was limited by hardware to only 5,000 concurrent.
No current real-time streaming generative video API exists due to poor economics and extreme peak load challenges.
Real-time generative ads and gaming, where latency below 100ms unlocks new industries, are currently not economically feasible with existing hardware.
For real-time code generation (e.g. autocomplete), small models and low batch sizes are used to meet latency needs, but this leads to high cost and lower model quality.
Many enterprise workloads, such as telcos serving 100 million hourly users, face peak load multiples (5–10× average), making current hardware investments uneconomical.
Etched has grown to 150 people, attracting talent from leading AI chip and software companies.
The company has sold out its first production run, suggesting rapid market uptake.
High-profile hires include veterans who led platform and systems development at Nvidia, and software leadership from Google's TPU and DeepMind projects.
Former leaders from Google’s Astra project joined Etched, citing the economic and deployment advantage of its specialized approach.
Etched’s focus on transformer-specific hardware may be the only near-term solution capable of handling the increasing demands of AI inference at scale.