Telstra’s cloud computing suffers 24 hour outage

19

news Telstra has confirmed that it suffered a major outage in its high-end corporate cloud computing platform last week that left a number of its most high-profile customers without some of their services for a period as long as 24 hours.

According to information provided to Delimiter, the company’s Exhibition St datacentre in the Melbourne central business district suffered a severity one outage starting on Monday night last week at about 7pm Eastern Standard Time. The issue related to its storage layer knocked offline services belonging to a number of major customers reputedly including VISY, Real Insurance, Hollard Financial Services, Oz Minerals and others.

Responding to the issue this week, a spokesperson for the telco confirmed it had suffered an outage on its cloud computing infrastructure. “Last week we had an intermittent service outage on our cloud platform that affected a small number, around 20, of our business customers,” the spokesperson said. “The issue started on Monday 25 March when we identified a failure in the data storage equipment that supported the customers that we affected.”

“When the failure was identified we immediately engaged our storage partner and started restoring services. By Tuesday (26 March) afternoon the majority of services had been restored, though restoration activities for a small number of customers continued into Wednesday. We continue to closely monitor the dedicated hosting services of all customers affected by this issue and apologise for the impact on their services.”

The news has the potential to knock the facility’s uptime capability almost out of the top high-availability tables for IT infrastructure. Under commonly used guidelines, technology services are classified by the amount of downtime they suffer per year. If substantial parts of Telstra’s cloud computing infrastructure was offline for the majority of one day last week, the company’s cloud platform may no longer be able to be classed as having uptime above 99.9 percent (‘three nines’), as this would entail suffering downtime limited to 8.76 hours per year.

It is common for enterprise cloud computing platforms to enjoy uptime higher than this figure. For example, cloud computing vendor Salesforce.com states on its website that its Force.com platform has had a proven 99.9+ uptime “for years”. The company publishes uptime statistics on its site.

The news comes as Telstra continues to push the case that its cloud computing platform reflects a sizable revenue opportunity for it going forward. In mid-2011 the company stated that it would spend $800 million in the space to develop its infrastructure and target new customers.

At the time, the telco said the $800 million spend will go on a range of areas over the succeeding five years, but would especially be focused on the construction of a new Melbourne datacentre, slated to go live this year, which would bolster Telstra’s hosting capability by more than 40 percent, with an extra 2000 square metres of space.

The money was also to go on modernising existing Telstra datacentres, expanding the range of enterprise applications the company provides, building a new integrated online account management portal, increasing the automation of utility computing services, and enhancing the capabilities of T-Suite. The company’s key partners for the next phase of its strategy will be Cisco on the hardware front, VMware for virtualization, Accenture for the company’s systems integration skills and Microsoft for software.

In December last year, Telstra announced it would construct four new datacentres to meet demand.

Paul Geason, Group Managing Director, Telstra Enterprise and Government, said Telstra was committed to providing localised infrastructure to cater for the needs of its customers around the country. “We continue to see tremendous growth in our cloud business with positive customer take up of the technology across a range of industries,” Geason said at the time. “With many organisations moving into the cloud, the feedback has been clear – the option to use local data centres is important either because their applications are sensitive to latency or they require data to be hosted within their state.”

opinion/analysis
Look, I’m sure not all of Telstra’s cloud computing infrastructure went down, and that much of the services being provided to customers remained up and functioning fine. In this situation, Telstra would throw considerable resources at the issue to get it fixed as soon as possible.

However, to enterprise IT customers — and Telstra has some big manufacturers and so on in its customer ranks for cloud computing — 24 hours is a very, very long outage for this kind of enterprise-grade cloud computing. Frankly, I would be very surprised if Telstra hadn’t broken some of its service level agreements here, and I wouldn’t be surprised to hear that there’s compensation in the works here and there.

You cannot … you absolutely cannot, as an enterprise cloud computing player, let your services stay down for a whole day. That’s the kind of situation which encourages customers to start looking elsewhere for these kinds of services. These sorts of outages aren’t supposed to happen with a provider as large as Telstra — that’s why you go with Telstra in the first place.

Image credits: Telstra

19 COMMENTS

  1. If the outage was caused by a file server going down then I would not describe Telstra’s network as cloud computing. If it was cloud computing then the data would have been spread between multiple machines with multiple redundancy same as Google and Amazon

  2. That’s what you get for putting your eggs in the “cloud” basket.

    Presumably the storage devices that failed were EMC, as they have a reputation for taking down customers environments. I can recall 2 similar stories with EMC SAN failures in the past couple of years.

  3. I really laugh at this stuff. Why does absolute crap like free to air television , which uses sophisticated SYNCRONOUS data networks rarely go down for as little as 5 minutes every 5 years . Yet important stuff like banking and now “the cloud” which is “the business” goes down for 24 hours or more !.

  4. Hmmm … s**t has always happened and will continue to do so I guess … as much as we would wish it otherwise. Cloud skeptics tend to jump on these outage stories with glee, but the organisations that are actually involved will take a more measured approach.

    For Telstra, this is a ‘rich learning experience’ … which will reinforce their commitment (and investment prioritization) to ensuring that the experience is not repeated. For the customer organisations affected, this will reinforce the need for appropriate risk management and business continuity strategies for their cloud sourced services. The net result from a strategic perspective is better for both parties … despite the short term pain.

    The reality is that cloud services adoption is all about trade-offs. Of course the best option is gold plated fully in-house world-class assets, applications and staff … but this is just not affordable or sustainable for many organisations … so cloud services have a valid place in the ICT mix. Occasional outages are just a fact of life, and don’t change this reality IMHO … unless of course they become a pattern and are evidence that the wrong cloud services provider has been chosen or cloud services is not an appropriate sourcing option for a particular workload or application.

    • Indeed, the concern I have regarding cloud, is that often the companies making the decisions don’t truly understand the potential risk. As with most outsourcing of services, it is driven by accountants looking at the dollars.

      I see no reason why the cloud isn’t a viable option. But you have to be aware that you are giving up an element of control and relying on a service provider. And the key thing with service providers, is they don’t really care about your business. So if you don’t build appropriate SLA’s and redundancies then you are shooting yourself in the foot. And most organisations I have been in, don’t build appropriate SLA’s.

      This is just an example of the risk. Like all things don’t throw the baby out with the bathwater, but take it as a warning, and make sure the cloud is appropriate to your core business.

      • Hi Woolfe,

        Responding to your comment “the key thing with service providers, is they don’t really care about your business” …

        This is true of outsourcing and managed service deals because you require the service provider to provide dedicated services customized for your needs … hence if they don’t ‘care about your business’ enough the service will be poor.

        I actually think, however, that one of the value propositions of cloud services, as massive scale shared services, is that the service provider doesn’t actually need to (and shouldn’t) “care about your business” in the hand-crafted, personalized, way of traditional outsourcing or managed services.

        The whole deal is premised on the fact that the cloud service provider has constructed an industrialized facility that delivers a high quality standardized (but somewhat configurable) service offering. The provider should care a great deal about THEIR BUSINESS (i.e. about quality, reliability, sustainability, relevance, functionality etc. of the service offering) and of course should care about the collective needs and satisfaction of customers … but they are not in the business of trying to care for each customer individually in a traditional dedicated/customized service provider manner.

        We can’t have it both ways … the benefits of cloud services stem from their industrialization … so we need to learn how to be intelligent consumers of generic industrialized services in order to become happy with this model … and to apply it to appropriate use cases.

        If you buy dedicated/customized managed services then you need a trusted service provider that understands your individual needs and ‘cares’ about you … and you need to ‘work the relationship’. When things go badly wrong you phone the lawyers because you are locked in to a long term contract.

        If you buy cloud services then you need a service provider that offers high quality standardized shared service that leverages economies of scale in development and operations across a large customer base … and cares deeply about their operational performance metrics for all customers as a group … but may not really even know much about you as an individual customer at all. When things go wrong you rely on the ‘wrath of the crowd’ to discipline the provider … and if that fails to improve the situation then you may need to terminate the service and exercise your tested plan B or find a replacement service.

        This market is still maturing, and many so-called cloud services are really just an evolved form of managed services (in terms of both the capabilities of the providers and the buying behaviors of the customers) … so we are on a journey. More trials and tribulations are sure to follow …

        • Oh I agree with you completely. That is essentially what I meant. The Consumer of the Service needs to ensure that they understand what it is they are getting into. So not just the up front cost benefits, but also the potential pitfalls etc.
          And as well as that, as you say they need to work with and manage the relationship with the service provider.

          But if your business relies upon an element being available, then that needs to be understood and taken into account when looking at cloud based services. Because if said services are unable or unwilling to ensure your service(and by that I mean keeping it going, not simply paying rebates if it goes down) is always on, then it may be that the cloud is not appropriate for your business yet.
          As you say it is still maturing, so they are really targeting the majority 80% of business at the moment, the more complex 20% exceptional business will come as additional controls and guarantees get bolted on.

  5. 24 hours? Try 6 days. Rebuilds and verifications rendered the Hitachi platform unusable until the following Monday.

  6. In response to Steve’s comment, your words are too kind. For customers choosing to pay up with Telstra with all its marketing jingles and premium rates, there will be an expectation of premier service and reliability.

    From my one on one experience with Telstra Cloud services, there is a gap between what is being marketed and the real world practice.

    The negative comments are more than valid, customers are not dealing with a geek who just setup a cloud service with little experience.

    • no the jumped out and completely source solutions out to vendors at great expense to get it up and running as quickly as possible without sitting down and thinking it out and choosing to skill I.T. locally.

      The premium is the speed in which the delivered their projects, not the quality of the service.

    • Hi James, yes, fair enough. The customers concerned would seem to have legitimate reason to be very annoyed with Telstra if the reporting is accurate. I suppose I was talking more in generalities about what this means for cloud services as a way of sourcing ICT capabilities.

      The problem we have in Australia is that even the largest cloud service providers are still scrambling up the learning curve in terms of their ability to build/implement/operate large scale high reliability cloud services and to turn around their existing process, service, marketing and sales models. Occasional dropping of the ball is just a reality … hence …

      Caveat emptor … don’t forget the seven basics of cloud services procurement:

      (1) Contract – make sure it covers off your business risk exposures.
      (2) SLA – make sure it meets your business needs.
      (3) Information categorisation and management – make sure you understand what data can and cannot be in a cloud service and how these decisions are made.
      (4) Data prenup – make sure you know how to get your data back … with local periodic replication/backup … daily if needs be.
      (5) Tested Plan B – make sure you know what do do if the cloud service goes down … and have a tested business continuity strategy if needs be.
      (6) Staff training – make sure your people understand how to use the service, how to cope with an outage and what their obligations are in regard to security practices etc.
      (7) Quality certification – make sure the provider has the requisite quality & security certifications.

      Beyond that … given that there is an overall business value benefit/risk proposition in favor of cloud adoption … cross your fingers!

  7. There will always be problems with every single vendor out there. Even them NASA guys have problems and I think their budget would be all of the providers put together and then some more.

    What becomes important and probably one of the biggest things to look at these days would be the amount of time it takes to recover services after an event.

    There was another provider the other day who had a storage issue. They seemed to take twice as long as Telstra to get their systems back up and running.

Comments are closed.