On Tuesday, Feb. 28, 2017, Desmos experienced a complete service outage of our
public calculator as well as the API that our partners use to embed the
calculator into their sites. Our outage was related to the outage of
Amazon’s S3 storage service, which also affected many other sites across
the internet. We’d like to explain what happened and some steps we are
taking to reduce the chance of something similar happening again.
Amazon S3 is a critical piece of internet infrastructure that many sites like
Desmos use for data storage. In particular, Desmos uses S3 to store our
users’ saved graphs.
At 9:37AM PST on Tuesday, the S3 service stopped responding to network
requests (more details from the Amazon team). As Desmos users attempted to access their saved graphs, open requests to
retrieve the graph data from S3 accumulated and eventually overwhelmed our web
servers. These servers stopped responding to network requests, including the
“health check” requests that we use to automatically remove
failing hardware from our system. Since these servers began to fail their
health checks, they were automatically terminated.
At 11:11AM PST, Desmos restored partial service including full service for our
partner API by routing traffic to servers in “maintenance mode.”
When the site is in maintenance mode, no external requests are made to either
S3 or our database, and users are able to use the calculator, but they cannot
sign in to their accounts or access saved graphs.
At 3:42PM PST, full service was restored for the rest of the site. At this
time, some of our internal infrastructure tools were performing very slowly,
so we waited until these tools were operating normally at 5:55PM PST to
announce that full service had been restored.
The most frustrating part of the outage from our point of view is that it took
over 90 minutes to restore API service and “maintenance mode”
service for the calculator. We had an emergency system in place that serves
the API and the calculator using only static files so that the site can
operate with basic functionality even if all of our web servers fail. Under
normal circumstances we would have been able to switch over to this system in
a matter of minutes; however, these static files were stored on S3, so the
same S3 outage that interrupted our primary service meant that we could not
use this emergency system either.
Going forward, we are planning to host a redundant copy of this emergency
system with a different storage provider, so that we can respond faster in
case of Amazon infrastructure failures.
Our emergency response was also complicated by the fact that all of our normal
web servers (which are also hosted with Amazon) were terminated when they
failed their health checks. Because of cascading failures of Amazon’s
systems, we were not able to bring up new replacement servers for several
hours. To restore partial service when we did, we had to repurpose a server
that is usually used as an internal staging server, manually switch it into
maintenance mode, and route all of our traffic to it.
Going forward, we will no longer automatically terminate servers that
unexpectedly fail health checks. Instead, we will simply route traffic away
from them, but leave them available for a period of time for inspection by our
engineers. This will allow us to route traffic back to existing servers more
rapidly in case of an external emergency that causes all servers to fail
health checks at the same time.
Finally, we would like to improve how quickly we communicate with our API
partners in case of an emergency. We sent our first email to partners after
approximately an hour of down time.
Going forward, it will be our policy to email partners immediately in case of
any API outage. Further explanation and advice will be communicated as it
becomes available, but should not delay initial communication of the fact that
there is a problem.
At Desmos, education is our passion, and we are painfully aware of how
disruptive it is to classrooms when technology does not work according to
plan. One of our core design principles is “works every time,”
which means that teachers should always be able to rely on our tools to work
the same way during class that they did during lesson preparation. We, like
much of the web, didn’t anticipate an S3 outage of this magnitude. But
working every time means ensuring that we’re resilient to even the most
unlikely events. We are striving to use the lessons that we learned from this
incident to make our service more reliable in the future.