https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/

by Daniel Sturman in collaboration with Scott Delap, Max Ross, & many others from Roblox and HashiCorp.

Product & Tech

Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.¹ Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy. We sincerely apologize to our community for the downtime.

We’re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.

Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.

Outage Summary

The outage was unique in both duration and complexity. The team had to address a number of challenges in sequence to understand the root cause and bring the service back up.

Preamble: Our Cluster Environment and HashiStack

Roblox’s core infrastructure runs in Roblox data centers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.