The Inside Track: What we learned from the Sydney AWS outage

13


This article was originally published for Delimiter Members only. In late September 2016, Delimiter ceased publishing new articles. Because of a number of operational and other factors associated with this decision, we subsequently withdrew membership articles from publication. If you would like to see a copy of this article, please contact Delimiter directly with your request. Requests by Delimiter Members will be granted. We will consider all other requests on their merits.


13 COMMENTS

  1. Sydney Zone 2 Eastern Creek. Is that still happening ?

    They need to run some kind of offline test for their backup power perhaps.

    But this is nothing like faulty copper falling over for days and weeks because of rain.

  2. What we really learnt is that either the people running on AWS in Sydney are either cheap or incompetent. Its not that hard to spin up your instances in the other availability zones (AZ) as only AZ2 went down and ensure that you are running even when one is taken out. You double your cost that’s true, but if you are one of the companies mentioned you probably need extra instances to cope with the load anyway.

    Its not that difficult. The only case where this is a problem is if you are backed by SQL Server running on RDS, and Amazon is working on fixing that as well as they now have a third AZ. Even that can be solved by running your own instances with SQL server and doing it yourself.

    In short there is NO excuse for Domain, Channel Nine, Domino’s, Foxtel Play or Stan to have gone down. None. They should be taking a good hard look at their IT operations teams.

    • To be fair, it may be entirely likely that their IT teams provided competent recommendations to management, who vetoed the proposal – ‘100% additional cost to rule out that 0.01% chance of downtime? I don’t think we’ll really need that, do you?’ went nearly every such discussion I’ve had with senior management ever…

  3. Stan. Would they be using Cloudfront ? if so Zone 2 would help there. If there is only one cloudfront sydney zone, and they chose not to replicate content to other cloudfront regions then of course that is a problem.

  4. Amazon have clearly made some rookie mistakes in their design. The tier 3 spec D.C. I used to manage never had issues with voltage sags, as soon as the building power management system detected a sag below a certain threshold and above a certain duration, it fired up the generators (talking roughly a 60 second timeline in total).

  5. There’s been little mention that there were major problems with elastic load balancers registering instances.

    The down time that my systems experience was due to this problem even with multiple AZs being used. I don’t think I was alone in experiencing this either :/

  6. Recommended read:

    http://www.theregister.co.uk/2016/06/09/aws_blames_latent_bug_for_prolonging_sydney_ec2_outage

    For what it’s worth: that the disastrous shape of wiring/cabling/fibre infrastructure in this country can’t make the job of orgs like AWS easier, yet it’s still weird that a latent bug like this wouldn’t be detected in full-on fail-over testing.

    Still: how many organisations with their own local hosting would have done better? How many old-school hosting shops? In my sad experience – including first hand as customer with some of the prior or current Australian top names – very few of them.

    Thing is, I expect AWS to actually throw resources at it to ensure it doesn’t happen again. Another thing that in my experience self- and old-hosted orgs still wouldn’t do. Oh, lip service? Sure! Real action? No.

    No, I don’t work for AWS.

  7. So did ELBs fail to pass traffic to instances already up and registered in other AZs? Or did people just have instances in the one AZ and then tried to bring up new ones during the outage? If the former, then things are very bad. If the latter… Who cares?

    • I know it’s only one data point and I could have been lucky, but had a detailed look through all the logs and every instance I had running stayed up and ELB did a good job of spawning new instances during the event. In fact I saw significant additional traffic due to lot of people shooting video of the weather as it happened, as well as the aftermath, but scaling worked well in both ap-southeast-2a & b, not only for my regular web instances, but more importantly my video encode workers which come and go quickly depending on load.

      • Awesome. Thanks for the info.

        Well then, get with it, everyone else. Lesson learned from this is not that AWS needs to solidify their AZs – they’re telling us very clearly that the fix is to be properly multi-AZ. Expect more of the same. Plan for failure!

Comments are closed.