The video discusses a major outage experienced by T3 Chat for the first time, where the site was nearly unusable for a couple of hours. The creator emphasizes the seriousness of uptime and their commitment to transparency 00:00.
The outage was linked to issues with the websocket connection layer, which failed during a migration to a new system using Convex, resulting in chats not loading and severe lag 03:39.
The creator describes the migration as a significant rewrite of the app, affecting around 10,000 lines of code, and not just a simple database change 05:28.
Migration Process and Issues
Prior to the migration, T3 Chat used a MySQL database on PlanetScale, which became unsustainable for their needs, prompting the switch to Convex for better data synchronization 04:44.
The initial migration process involved fetching user data from the MySQL database in chunks and writing it to Convex, but issues arose with incorrect user IDs causing migration loops 12:33.
The creator acknowledges the failure of the first migration and subsequent attempts, leading to the introduction of a beta version to test the migration under controlled conditions 14:09.
Technical Challenges Encountered
The second migration attempt also failed due to a surge in user activity that overwhelmed the system, revealing that the method of processing migrations concurrently was inadequate 15:21.
A workpool component was introduced to manage migration tasks more effectively, limiting the number of concurrent migrations to help stabilize the system 16:32.
The creator identified that excessive websocket connections from users keeping T3 Chat open in background tabs contributed to server overloads 31:00.
Convex Collaboration and Resolution
The response team from Convex worked closely with T3 Chat during the outage, identifying issues with query throughput and websocket reconnections that led to severe load spikes 22:10.
Convex's internal limitations on connection handling and their text search indexing caused additional strain, leading to further issues during peak usage times 34:05.
The collaboration resulted in immediate fixes and long-term strategies to improve server handling and performance for T3 Chat, with suggestions for further optimizations 51:25.
Moving Forward
The creator outlines plans for better communication during outages, including implementing a status page and a paging system to alert them during incidents 53:46.
They express gratitude for user support and stress the commitment to improving the reliability of T3 Chat, acknowledging the importance of transparency and accountability in service outages 01:03:04.
Overall, the video serves as both a reflection on the challenges faced during the outage and a detailed account of lessons learned for future improvements.