Why the internet went down for 2.5 hours yesterday

Overview of the Internet Outage 00:00

  • A wide range of popular services including Discord, Google Cloud, and Spotify experienced downtime, leading to concerns about reliance on a few major providers.
  • The speaker expresses fear regarding the centralization of internet services and discusses the financial impact of the outage.

Sponsorship and Post Hog 00:49

  • The speaker introduces Post Hog, an analytics and feature management tool, and shares how it has been beneficial for their project T3 Chat.

Core Internet Infrastructure 02:35

  • Discussion on the major cloud service providers: AWS, Google Cloud, Azure, and Cloudflare, noting that Azure was not affected.
  • AWS experienced some issues but no definitive reports confirm a major outage.

Causes of the Outage 04:54

  • Cloudflare and Google Cloud faced outages due to Cloudflare's Worker KV service failing, which is critical for many of its products.
  • The speaker clarifies that the outage was not due to a security event and there was no data loss.

Impact and Affected Services 12:08

  • The outage lasted 2 hours and 28 minutes, affecting all Cloudflare customers using the impacted services.
  • Key services affected included access management, image uploads, and real-time streaming, with a high failure rate for Worker KV requests.

Technical Breakdown of Worker KV 07:40

  • Worker KV is essential for many Cloudflare services, providing a key-value storage solution that many applications depend upon.
  • The outage was traced back to a failure in the underlying storage infrastructure that Worker KV relies on.

Response and Recovery Actions 26:02

  • Cloudflare's team worked quickly to identify the cause and began efforts to remove the dependency on Worker KV as the outage unfolded.
  • Services gradually began to recover as the underlying storage infrastructure was restored.

Future Improvements 28:02

  • Cloudflare plans to improve the resiliency of its services, focusing on reducing dependencies on single providers and enhancing storage infrastructure.
  • Strategies will include better redundancy for Worker KV and tools to manage service availability during outages.

Conclusion 31:01

  • The speaker commends Cloudflare for their transparency and ownership of the incident, highlighting the lessons learned from the outage.
  • Acknowledges the importance of owning the reliability of services provided to users and hopes for a more resilient internet infrastructure in the future.