On Tuesday, Feb. 28, 2017, Desmos experienced a complete service outage of our
public calculator as well as the API that our partners use to embed the
calculator into their sites. Our outage was related to the outage of
Amazon’s S3 storage service, which also affected many other sites across
the internet. We’d like to explain what happened and some steps we are
taking to reduce the chance of something similar happening again.
Amazon S3 is a critical piece of internet infrastructure that many sites like Desmos use for data storage. In particular, Desmos uses S3 to store our users’ saved graphs.
At 9:37AM PST on Tuesday, the S3 service stopped responding to network requests (more details from the Amazon team). As Desmos users attempted to access their saved graphs, open requests to retrieve the graph data from S3 accumulated and eventually overwhelmed our web servers. These servers stopped responding to network requests, including the “health check” requests that we use to automatically remove failing hardware from our system. Since these servers began to fail their health checks, they were automatically terminated.
At 11:11AM PST, Desmos restored partial service including full service for our partner API by routing traffic to servers in “maintenance mode.” When the site is in maintenance mode, no external requests are made to either S3 or our database, and users are able to use the calculator, but they cannot sign in to their accounts or access saved graphs.
At 3:42PM PST, full service was restored for the rest of the site. At this time, some of our internal infrastructure tools were performing very slowly, so we waited until these tools were operating normally at 5:55PM PST to announce that full service had been restored.
The most frustrating part of the outage from our point of view is that it took over 90 minutes to restore API service and “maintenance mode” service for the calculator. We had an emergency system in place that serves the API and the calculator using only static files so that the site can operate with basic functionality even if all of our web servers fail. Under normal circumstances we would have been able to switch over to this system in a matter of minutes; however, these static files were stored on S3, so the same S3 outage that interrupted our primary service meant that we could not use this emergency system either.
Going forward, we are planning to host a redundant copy of this emergency system with a different storage provider, so that we can respond faster in case of Amazon infrastructure failures.
Our emergency response was also complicated by the fact that all of our normal web servers (which are also hosted with Amazon) were terminated when they failed their health checks. Because of cascading failures of Amazon’s systems, we were not able to bring up new replacement servers for several hours. To restore partial service when we did, we had to repurpose a server that is usually used as an internal staging server, manually switch it into maintenance mode, and route all of our traffic to it.
Going forward, we will no longer automatically terminate servers that unexpectedly fail health checks. Instead, we will simply route traffic away from them, but leave them available for a period of time for inspection by our engineers. This will allow us to route traffic back to existing servers more rapidly in case of an external emergency that causes all servers to fail health checks at the same time.
Finally, we would like to improve how quickly we communicate with our API partners in case of an emergency. We sent our first email to partners after approximately an hour of down time.
Going forward, it will be our policy to email partners immediately in case of any API outage. Further explanation and advice will be communicated as it becomes available, but should not delay initial communication of the fact that there is a problem.
At Desmos, education is our passion, and we are painfully aware of how disruptive it is to classrooms when technology does not work according to plan. One of our core design principles is “works every time,” which means that teachers should always be able to rely on our tools to work the same way during class that they did during lesson preparation. We, like much of the web, didn’t anticipate an S3 outage of this magnitude. But working every time means ensuring that we’re resilient to even the most unlikely events. We are striving to use the lessons that we learned from this incident to make our service more reliable in the future.