CBA outage took down CommSee

12

news The Commonwealth Bank’s wide-ranging outage also took down its customer relationship management platform CommSee, one of its main unions has revealed, in a move which further illustrates how extensive the technology-related problems suffered by the bank over the past week truly have been.

On Thursday last week, according to sources, a patch was issued using Microsoft’s System Center Configuration Manager (SCCM) remote deployment tool within CommBank. It appears as if the patch was intended to be distributed to a small number of the bank’s desktop PCs as a disaster recovery exercise only, but it was mistakenly applied to a much wider swathe of the bank’s desktop and server fleet than was intended. Late last week, sources said that some 9,000 desktop PCs, hundreds of mid-range Windows servers (sources said as high as 490) and even iPads had been rendered unusable due to software corruption issues associated with the patch, with HP (one of the bank’s IT outsourcing partners) and its internal team scrambling to restore systems.

The issues had not previously been reported to have effected any of the bank’s top-level technology systems. However, in a statement issued this week, the Financial Services Union revealed the crash also took the bank’s Commsee customer relationship management system offline. CommSee is a critical system for CommBank. Developed in the early years after 2000 and deployed through the middle years of the decade, the system enables the bank’s staff to get a single view of each customers’ information, drawn from various internal banking resources.

“While the efforts of staff throughout the bank meant that most customers would have been oblivious to the problems, the bank’s reliance on the system meant that most staff were working without access to critical information and facilities,” the FSU wrote. “It was effectively impossible to complete much of the work.”

“The clean-up and backlog caused by the crash has meant that it would have been impossible for many staff to complete planned work for up to five days and it will now be difficult to catch up over the coming weeks. On this basis FSU believes that it would be unreasonable to hold staff accountable for targets set for July, where between three and five of the last working days of the month were lost and August, where an unclear number of days will be spent catching up on July’s unfinished work.”

The FSU noted in its statement that it had met with CommBank yesterday, and had called on the bank to suspend employee targets for July and August in the wake of what it called the “disastrous” crash. In response to its request, the FSU wrote, representatives of the bank acknowledged the problems caused by the system crash and said that an announcement would be made soon about targets. The bank said, according to the FSU, that sales meetings should not have taken place on Monday 30 July.

The FSU said it had been informed that the broader outage had occurred because of the actions of “an outsourced provider”. Delimiter believes that the rogue SCCM deployment originated within a HP team in New Zealand, but neither CommBank nor HP has publicly confirmed that allegation. A HP spokesperson declined to comment on the issue today, while a Commonwealth Bank spokesperson has not yet returned a call requesting further comment this afternoon.

In a statement last week, the bank didn’t provide any details of what it said was “a problem with an internal software upgrade”, and played down the issue. It noted that while the vast majority of its over 1,000 branches were offering full services and its ATM, and Internet and phone banking services were unaffacted, about 95 branches were only offering limited services, such as access to automatic teller machines.

“Our branch staff in each affected branch are available as usual to assist our customers with their enquiries,” the bank said in a statement issued on Friday. “Customers may experience some increased wait times in some of our branches and when calling our call centres. Our priority is on restoring all services as quickly as possible and we apologise for the inconvenience.”

“It is unclear whether the problems would have been avoided if the work had remained in-house but the problem underpins the FSU criticism of outsourcing where the bank loses direct control over end to end processes,” the union wrote, pointing out that CBA had recently announced its IT help desk facilities at Sydney Olympic Park would be outsourced to HP, resulting in 50 CBA jobs being lost.

HP, is believed to have allocated additional resources in an emergency effort to re-image the servers and desktop PCs from scratch with the bank’s standard operating environment and other platforms where appropriate, with the bank lodging a ‘P1′ highest priority incident notice with the company. Internally, some staff at HP were told to throw every resource possible at the situation. CommBank’s own backup and restore teams were also believed to be throwing resources at the issue wholesale last week and over the past few days.

opinion/analysis
With the union statement issued this week and other bits and pieces of information which we have received, we’ve now got several new pieces of information with respect to CommBank’s outage last week (which is still affecting the bank’s operations to some degree).

Firstly, it appears pretty clear now that CommBank’s issues originated within HP, and likely within one of the company’s New Zealand teams. I don’t think the FSU’s statement linking this particular issue to the issue of IT outsourcing in general is that legitimate (this sort of human error could just have easily have occurred if CommBank’s IT operations were completely insourced), but it does seem clear that this looks like a stuffup on the part of a major CommBank IT services provider. There will be questions raised about to what extent governance procedures were in place both within that provider as well as within CommBank itself, to oversee the services provided.

Secondly, we have the new and somewhat disturbing information that not only were thousands of the bank’s PCs and servers taken down due to the outage, but that one of its main critical systems was also affected. For CommSee to be taken down in CommBank is no laughing matter. This is a “heads will roll” kind of situation. CommSee is the kind of system which thousands upon thousands of CommBank staff rely on daily to get basic stuff done. As the union says, if it doesn’t work, the bank doesn’t work, and that’s a huge issue.

What all of this adds up to is the sort of situation which will likely have had CommBank’s chief executive Ian Narev and its chief information officer Michael Harte stampeding around its head office in Sydney’s Commonwealth Bank Place with looks of burning fury writ large upon their faces. This sort of thing isn’t supposed to happen at CommBank anymore — with all of the huge improvements CommBank has made to its internal IT over the past decade, this is one bank which is supposed to be beyond this kind of outage.

To think that a simple configuration mistake by one or a small handful of staff at an outsourcer could bring down so many critical pieces of IT infrastructure at one of Australia’s largest IT shops is just staggering. It starkly illustrates that for all of their advances over the past decade in terms of reliability, capability and service assurance, Australia’s banks really do have a long way to go to get their IT systems to the state of stability which the next few decades will demand of them.

Because if this kind of thing could happen to CommBank, which in almost every way is the technology leader in Australia’s financial services sector, then it could also happen to anyone else — at almost any time.

For me personally, this issue is also a reminder that as much as the global technology industry believes itself to be mature in many areas, the truth is that it’s not. The truth is that we are still right at the dawn of humanity’s understanding of technology in general, and how to keep technology working all the time, come what may. Today may be a good time to reflect on the fact that we’ve only had modern computer-based systems as we know them for little over half a century. No doubt it will take another half a century or more until they become stable enough to be truly described as “reliable”.

In 2012, you can plough billions of dollars into technology infrastructure if you so desire. But that doesn’t mean that a human can’t take much of it down with the accidental flick of a switch.

Image credit: megawatts86, Creative Commons

12 COMMENTS

  1. It sounds like someone designed SCCM corrections incorrectly and placed most of their infrastucture inside a collection. Which would of been a forced pushed an advertisement out to all PCs and services inside the collection

  2. Actually, it’s quite legitimate to blame outsourcing. As a process, outsourcing generally involves replacing experienced bank staff with cheap plug-in staff at the outsourcer.

    Typically some MBA has gone along and said: “this job just needs an SCCM operator. We can get them for $35 an hour.” So the outsourcer puts out an ad that’s more concerned about what version of SCCM the candidate has used.

    Meanwhile the guys who used to do the job in the bank probably had safe procedures that always worked, knew which branches had dodgy PCs, which days were bad for updates (because managers were at golf) and so on. The MBAs lacked the expertise to perceive the real value the old staff provided.

    Almost all the strange bank problems over the past two years have involved outsourcers.

  3. Word has it (from an extremely well placed source) that HP has flown its most senior executives, including CEO Meg Whitman in from the US for damage control. One can only assume that CBA may be looking for blood… and that blood could be cancellation of the impending $700mill HP contract extension!

    • Not suprised. I worked for an IT company that was providing the CBA with a contracted service worth over $50 million. 1 year into the 5 year contract the CommBank CIO asked our CEO to give him a good reason why he shouldn’t cancel the contract there and then for poor delivery of the service. Needless to say our CEO kicked a lot of butts and salavaged the contract and relationship with the CBA.

      It just goes to show the CBA are big enough and ugly enough to demand the best. And good on them, I like to see the big outsourcers on their knees crying for mercy, from time to time.

  4. I’ve been an SCCM Admin for many years. As a direct hire employee, as well as a third party provider. I’ve seen huge targeting mistakes like this one and it has often been by direct hires. (Also keep in mind that when a company outsources, they outsourcer often hires employees from the outsourcing company..)

    This is not just a question of outsourcing. It’s a question of losing focus on the task at hand AND not having enough checks and balances in place to mitigate these types of risks.

    The patch package could have been written with logic to also limit the target audience rather than to rely solely on collection criteria. As well, the Servers and Workstations should have been on different sites entirely.

    • Those are some good comments. There’s another interesting aspect to this report.

      It seems to have become standard practice for unions to publicly trash any company they have been in dispute with. It is surprising that unions with a lot of members at a company appear blind to the fact that such attacks will affect the reputation of the company (as intended), which will reduce the amount of business they do, which will reduce the number of employees they have.

      One of the best (worst) examples of this union foot-shooting is Qantas, which has been under sustained attack from some unions for years. The resulting loss of public confidence and consequent reduction in the size of the company has caused a significant loss of jobs.

      • QANTAS have shot _themselves_ in the foot, repeatedly and deliberately, since acquiring Joyce as CEO. They have refused to work with their own people, preferring to strand many thousands of their own customers. They have given away lucrative overseas routes, made grandiose “plans” which don’t materialize, got rid of many of their most experienced pilots and made really bad choices in fleet purchases. See Crikey’s “Plane Talking” aviation blog for more details.

  5. I wonder how much the hiring freeze, redundancies and culling of contingent workforce in the past six months has impacted HP.

    From personal experience and from talking to others who used to work there, it seems that experienced local engineers / sysadmins have been walked. This puts extra pressure on the poor FTE folk left. I’ve spoken to admins that regularly put in 60-70 hour weeks trying to keep their heads above water, so to speak. Add to that, the loss of accumulated knowledge and having admins thrust into unfamiliar environments…. I’d suggest it is a miracle this kind of thing hasn’t happened sooner (and talk on the inside is that similar, yet obviously smaller, stuff does happen).

    It’s hard to sympathise with big companies who cut their IT staff to save a dollar. If your business is absolutely reliant on your IT systems, why would you trust it in the hands of big global companies like HP?

Comments are closed.