Amazon UPS design at fault in Sydney outage

3

blog As you may have noticed, Amazon Web Services is not precisely having a fantastic week in Australia.

The US cloud computing giant has pretty much had a dream run in Australia up until now, with its local datacentre infrastructure enjoying strong rates of up-time, and many big brand name customers defecting from their own datacentre into Amazon’s cloud.

But on Sunday that dream run came to an end, with wild weather in Sydney knocking Amazon’s datacentre offline and many of those same customers seeing their infrastructure coming grinding to a screeching halt.

Today Amazon has published an extensive statement noting how things went so drastically wrong, and what the company is going to do to fix the issue. As it turns out, it’s Amazon’s fault this happened. A key paragraph:

“The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage). Because of the unexpected nature of this voltage sag, a set of breakers responsible for isolating the [diesel rotary uninterruptible power supplies] from utility power failed to open quickly enough.”

In short, what Amazon is saying here is that because the weekend’s power outage in the broader electricity grid was not a clean break, but just a kind of a “sag”, its own generators didn’t come online fast enough to keep its customers’ IT infrastructure running.

Amazon is now fixing the problem:

“… it is apparent that we need to enhance this particular design to prevent similar power sags from affecting our power delivery infrastructure. In order to prevent a recurrence of this correlated power delivery line-up failure, we are adding additional breakers to assure that we more quickly break connections to degraded utility power to allow our generators to activate before the UPS systems are depleted.”

I’ll have more to say about this in a detailed analysis shortly, but I just wanted to note that I recommend people read Amazon’s statement on this issue today. It’s detailed and gives the behind the scenes story to what happened here. It’s precisely the kind of detailed statement and apology that major IT vendors such as AWS need to provide their customers with … when they screw up as badly as this.

Image: Andrew Mager, Creative Commons

3 COMMENTS

  1. It’s a good run. Having the second data centre up at eastern creek could help also. It’s not like rain makes our faulty copper internet fall over for days and weeks.

  2. A sag they say? Most people call them brown outs, and they should be expected and handled by UPS systems.

    • The UPS themselves caught the brownout, but the gensets that were supposed to come on before the UPS batteries died didn’t.

      They should have of course, that’s the whole point.

Comments are closed.