My biggest failure to date

Overview of T3 Chat Outage

  • The video discusses a major outage experienced by T3 Chat for the first time, where the site was nearly unusable for a couple of hours. The creator emphasizes the seriousness of uptime and their commitment to transparency 00:00.
  • The outage was linked to issues with the websocket connection layer, which failed during a migration to a new system using Convex, resulting in chats not loading and severe lag 03:39.
  • The creator describes the migration as a significant rewrite of the app, affecting around 10,000 lines of code, and not just a simple database change 05:28.

Migration Process and Issues

  • Prior to the migration, T3 Chat used a MySQL database on PlanetScale, which became unsustainable for their needs, prompting the switch to Convex for better data synchronization 04:44.
  • The initial migration process involved fetching user data from the MySQL database in chunks and writing it to Convex, but issues arose with incorrect user IDs causing migration loops 12:33.
  • The creator acknowledges the failure of the first migration and subsequent attempts, leading to the introduction of a beta version to test the migration under controlled conditions 14:09.

Technical Challenges Encountered

  • The second migration attempt also failed due to a surge in user activity that overwhelmed the system, revealing that the method of processing migrations concurrently was inadequate 15:21.
  • A workpool component was introduced to manage migration tasks more effectively, limiting the number of concurrent migrations to help stabilize the system 16:32.
  • The creator identified that excessive websocket connections from users keeping T3 Chat open in background tabs contributed to server overloads 31:00.

Convex Collaboration and Resolution

  • The response team from Convex worked closely with T3 Chat during the outage, identifying issues with query throughput and websocket reconnections that led to severe load spikes 22:10.
  • Convex's internal limitations on connection handling and their text search indexing caused additional strain, leading to further issues during peak usage times 34:05.
  • The collaboration resulted in immediate fixes and long-term strategies to improve server handling and performance for T3 Chat, with suggestions for further optimizations 51:25.

Moving Forward

  • The creator outlines plans for better communication during outages, including implementing a status page and a paging system to alert them during incidents 53:46.
  • They express gratitude for user support and stress the commitment to improving the reliability of T3 Chat, acknowledging the importance of transparency and accountability in service outages 01:03:04.
  • Overall, the video serves as both a reflection on the challenges faced during the outage and a detailed account of lessons learned for future improvements.