news The Commonwealth Bank’s wide-ranging outage also took down its customer relationship management platform CommSee, one of its main unions has revealed, in a move which further illustrates how extensive the technology-related problems suffered by the bank over the past week truly have been.
On Thursday last week, according to sources, a patch was issued using Microsoft’s System Center Configuration Manager (SCCM) remote deployment tool within CommBank. It appears as if the patch was intended to be distributed to a small number of the bank’s desktop PCs as a disaster recovery exercise only, but it was mistakenly applied to a much wider swathe of the bank’s desktop and server fleet than was intended. Late last week, sources said that some 9,000 desktop PCs, hundreds of mid-range Windows servers (sources said as high as 490) and even iPads had been rendered unusable due to software corruption issues associated with the patch, with HP (one of the bank’s IT outsourcing partners) and its internal team scrambling to restore systems.
The issues had not previously been reported to have effected any of the bank’s top-level technology systems. However, in a statement issued this week, the Financial Services Union revealed the crash also took the bank’s Commsee customer relationship management system offline. CommSee is a critical system for CommBank. Developed in the early years after 2000 and deployed through the middle years of the decade, the system enables the bank’s staff to get a single view of each customers’ information, drawn from various internal banking resources.
“While the efforts of staff throughout the bank meant that most customers would have been oblivious to the problems, the bank’s reliance on the system meant that most staff were working without access to critical information and facilities,” the FSU wrote. “It was effectively impossible to complete much of the work.”
“The clean-up and backlog caused by the crash has meant that it would have been impossible for many staff to complete planned work for up to five days and it will now be difficult to catch up over the coming weeks. On this basis FSU believes that it would be unreasonable to hold staff accountable for targets set for July, where between three and five of the last working days of the month were lost and August, where an unclear number of days will be spent catching up on July’s unfinished work.”
The FSU noted in its statement that it had met with CommBank yesterday, and had called on the bank to suspend employee targets for July and August in the wake of what it called the “disastrous” crash. In response to its request, the FSU wrote, representatives of the bank acknowledged the problems caused by the system crash and said that an announcement would be made soon about targets. The bank said, according to the FSU, that sales meetings should not have taken place on Monday 30 July.
The FSU said it had been informed that the broader outage had occurred because of the actions of “an outsourced provider”. Delimiter believes that the rogue SCCM deployment originated within a HP team in New Zealand, but neither CommBank nor HP has publicly confirmed that allegation. A HP spokesperson declined to comment on the issue today, while a Commonwealth Bank spokesperson has not yet returned a call requesting further comment this afternoon.
In a statement last week, the bank didn’t provide any details of what it said was “a problem with an internal software upgrade”, and played down the issue. It noted that while the vast majority of its over 1,000 branches were offering full services and its ATM, and Internet and phone banking services were unaffacted, about 95 branches were only offering limited services, such as access to automatic teller machines.
“Our branch staff in each affected branch are available as usual to assist our customers with their enquiries,” the bank said in a statement issued on Friday. “Customers may experience some increased wait times in some of our branches and when calling our call centres. Our priority is on restoring all services as quickly as possible and we apologise for the inconvenience.”
“It is unclear whether the problems would have been avoided if the work had remained in-house but the problem underpins the FSU criticism of outsourcing where the bank loses direct control over end to end processes,” the union wrote, pointing out that CBA had recently announced its IT help desk facilities at Sydney Olympic Park would be outsourced to HP, resulting in 50 CBA jobs being lost.
HP, is believed to have allocated additional resources in an emergency effort to re-image the servers and desktop PCs from scratch with the bank’s standard operating environment and other platforms where appropriate, with the bank lodging a ‘P1′ highest priority incident notice with the company. Internally, some staff at HP were told to throw every resource possible at the situation. CommBank’s own backup and restore teams were also believed to be throwing resources at the issue wholesale last week and over the past few days.
With the union statement issued this week and other bits and pieces of information which we have received, we’ve now got several new pieces of information with respect to CommBank’s outage last week (which is still affecting the bank’s operations to some degree).
Firstly, it appears pretty clear now that CommBank’s issues originated within HP, and likely within one of the company’s New Zealand teams. I don’t think the FSU’s statement linking this particular issue to the issue of IT outsourcing in general is that legitimate (this sort of human error could just have easily have occurred if CommBank’s IT operations were completely insourced), but it does seem clear that this looks like a stuffup on the part of a major CommBank IT services provider. There will be questions raised about to what extent governance procedures were in place both within that provider as well as within CommBank itself, to oversee the services provided.
Secondly, we have the new and somewhat disturbing information that not only were thousands of the bank’s PCs and servers taken down due to the outage, but that one of its main critical systems was also affected. For CommSee to be taken down in CommBank is no laughing matter. This is a “heads will roll” kind of situation. CommSee is the kind of system which thousands upon thousands of CommBank staff rely on daily to get basic stuff done. As the union says, if it doesn’t work, the bank doesn’t work, and that’s a huge issue.
What all of this adds up to is the sort of situation which will likely have had CommBank’s chief executive Ian Narev and its chief information officer Michael Harte stampeding around its head office in Sydney’s Commonwealth Bank Place with looks of burning fury writ large upon their faces. This sort of thing isn’t supposed to happen at CommBank anymore — with all of the huge improvements CommBank has made to its internal IT over the past decade, this is one bank which is supposed to be beyond this kind of outage.
To think that a simple configuration mistake by one or a small handful of staff at an outsourcer could bring down so many critical pieces of IT infrastructure at one of Australia’s largest IT shops is just staggering. It starkly illustrates that for all of their advances over the past decade in terms of reliability, capability and service assurance, Australia’s banks really do have a long way to go to get their IT systems to the state of stability which the next few decades will demand of them.
Because if this kind of thing could happen to CommBank, which in almost every way is the technology leader in Australia’s financial services sector, then it could also happen to anyone else — at almost any time.
For me personally, this issue is also a reminder that as much as the global technology industry believes itself to be mature in many areas, the truth is that it’s not. The truth is that we are still right at the dawn of humanity’s understanding of technology in general, and how to keep technology working all the time, come what may. Today may be a good time to reflect on the fact that we’ve only had modern computer-based systems as we know them for little over half a century. No doubt it will take another half a century or more until they become stable enough to be truly described as “reliable”.
In 2012, you can plough billions of dollars into technology infrastructure if you so desire. But that doesn’t mean that a human can’t take much of it down with the accidental flick of a switch.