By John P. Desmond, – IAIDL Editor
On the morning of Tuesday, June 8, many websites went down after an outage at the cloud service firm Fastly, a content data network (CDN) provider.
Sites affected included Amazon, Hulu, The New York Times, CNN, the Guardian, Bloomberg News, The Financial Times and the Verge. Also affected were the Reddit, Pinterest and Twitch platforms.
In a post on the Fastly blog on the day of the outage, Nick Rockwell, the company’s senior VP of engineering and infrastructure, stated that a bug was introduced by the company’s own developers by mistake in a software update, and that bug was triggered when a customer modified a CDN configuration, which is a routine procedure.
“We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal,” stated Rockwell, who was apologetic. “This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”
The damage radius was wide, causing alarm until the cause of the outage was made known and service started to be restored.
‘Cascading Failure’ Results from Bug in a Software Update
Complex cloud-based systems with many dependencies pose risks, especially when things go wrong. “You can end up with these cascading failures,” stated Christopher Meiklejohn, a PhD student at Carnegie Mellon’s Institute for Software Research, in an account from Vox. “They’re difficult to debug. They’re stressful and difficult to resolve. And they can be very difficult to detect early on when you’re thinking about making that change, because the systems are so complex, and they involve so many moving parts.”
The vast systems of CDNs like Fastly, which is one of many, can involve thousands of servers deployed around the world, Meiklejohn stated, making it more likely an outage will be widespread if an error is introduced in the core software. The fact the bug was missed by Fastly’s quality control process is embarrassing for the company. “We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes,” Rockwell stated in his post.
The Vox account likened the Fastly outage to one in 2011 when an Amazon cloud computing system, Elastic Block Store, crashed and took Reddit, Quora and Foursquare offline. After the incident, Amazon stated that one of its engineers inadvertently caused a technical problem that traveled throughout its systems and caused the outage.
The Fastly outage was referred to as an “object lesson in internet fallibility” in an account in The Financial Times. The writer of the account stated, “The failure is a reminder that ‘bugs’ lie buried in all new software programs. Maybe artificial intelligence will one day be able to anticipate and fix all the situations in which a piece of software can fail.”
The CDNs move content closer to users, which improves response times, CDN services include web caching, request routing and server-load balancing, to reduce load times and improve website performance, according to an account from g2, which guides users in the selection of software and services.
Companies that use CDNs include online video streaming providers and e-commerce companies whose services are adversely affected by poor performance. CDN services are often used in conjunction with website hosting services to optimize content delivery speeds.
Customers have many options for which CDN to employ. G2 listed over 100 CDNs in its account. Fastly was in the top 10, which also included Cloudflare, CloudFront, KeyCDN, Microsoft Azure CDN and Google Cloud CDN.
Companies with Multiple CDNs Were Able to Shift Workloads
Some Fastly customers were able to minimize the impact of the outage by shifting workloads to alternate providers, according to an account from ThousandEyes, a network intelligence company. The CDNs provide distributed local delivery, without which streaming media services would not be able to provide high quality digital experiences, for example.
Most CDNs today offer advanced security functionality, and are able to block common malicious traffic, as well as large scale denial of service attacks. Fundamentally, the CDN’s perform two functions: deliver content from their edge nodes to end users and fetch dynamic content from the site origin to deliver at the edge, according to the ThousandEyes account.
Many popular high-volume sites use more than one CDN provider to deliver content to users, primarily for redundancy but also for optimizing performance. This is done for example by load balancing user requests across multiple CDNs.
“How a site or application owner chooses to architect its content delivery can determine the severity of impact of an outage like the one Fastly experienced,” stated the account’s author, Angelique Medina, Director of Product Marketing for ThousandEyes. “Some of Fastly’s customers had resilient delivery architectures or they were able to take action to mitigate the impact of the incident—leading to very different outcomes for their users,” she noted.
The company examined the experience of four companies in detail. The New York Times and Reddit each used Fastly’s service as the sold CDN for their primary domains, but the two firms had different experiences. Beginning at 9:50 UTC (5:50 am ET), Reddit was down from around the globe; service was restored about an hour later.
The New York Times in contact temporarily redirected users to the site’s origin servers hosted on Google Cloud Platform, reducing the downtime of its service for users. The beginning of the outage was similar to the experience of Reddit, but 40 minutes into the outage, the service “significantly increased,” well before Fastly implemented a fix. By 10:50 UTC, no Fastly servers were in the delivery path for the NYT.
After Fastly implemented its fix, just before 10:50 UTC, the NYTimes users were redirected back to the Fastly servers. By 11:30 UTC, the site was returned to its pre-outage state.
Amazon uses three CDNs to deliver its site, load balancing traffic across each to deliver the best possible experience to its users. Amazon has its own CDN service, Cloudfront, that is part of its AWS offerings. Amazon also uses Akamai and Fastly to host its site.
An example of one CDN vantage point showed it targeting Amazon’s site and being directed to a Fastly server just after 8:00 UTC. A few minutes later, it was directed to an Akamai server, and less than 10 minute later it was switched over to an Amazon server. “This active allocation of users across multiple CDN services is part of normal operations for Amazon,” Medina stated.
Amazon eventually steered users to site components hosted by its own CDN and others, such as Akamai and EdgeCast. By approximately 10:40 UTC, site loading issues had been resolved for most Amazon users.