THE AUSTRALIAN CENSUS CHALLENGE
By Peter McCallum
We up here in the US rarely take note of things going on down in Australia, unless it involves bikinis, surfing, or great white shark attacks. It’s sad but true. Although, if you were to ask any of us, we’d probably list it as one of our bucket-list dream vacations. Some Americans would be shocked to find out that there are indeed over 23 million Australians who are not in movies, nor carrying the last name “Hemsworth” or Jackman. Do you know how we would know that information? From the census. Yeah. So here we are. Did you know that the latest effort on tallying up those mysterious Aussies was also fraught with peril?
You see, it used to be that a census was taken by hand, or by mail. Half of the people who could respond didn’t and had to be tracked down. Finding out WHO your citizens are is a daunting and very expensive process. When I say expensive, the 5-year budget for Australia’s census was a rough $470 million. I would tend to think… “Just make a wild guess based on the last go-around and give everyone their $20 cut, eh?” Instead, Australia Bureau of Statistics (ABS from now on so I don’t have to spell Bureau again..) hired a service provider to put it all online and save taxpayers over $100 million.
On the night of the census, there was an apparent problem in the networking equipment that triggered false alarms of attacks which shut down the website. Boom. 23 Million people (or more) wanting to do their civic duty get a nice “try back later” page. For us in the US, this might hit close to home when our Affordable Care Act (ObamaCare) website (cost $2.1 billion) didn’t function like it was supposed to on day one as well. So the ABS website was down for over 40 hours due to a bit of code that apparently didn’t load correctly during a reboot of a piece of hardware. 40 hours. One router. Sound right to you?
The reason why this is important to me is that in the follow-on press conferences with government officials, the service provider was squarely blamed for the problem. Someone had to pay the $30 million in damages (on top of the $470 million, but after the $100 million savings? Or was that $470 after the $100 discount? We may never know). The service provider, of course, blamed one of their providers, who, of course, said that they were offered, but did not partake in, the service that would have prevented this. ABS stated that they made some mistakes and probably should have extended the consulting time a little longer. How much would THAT have cost?
Here’s the reality: Outsourcing has a dark side. We pay a reputable company roughly $20 a head so that every citizen in the country can log in to a website and answer a few questions. It’s a simple service level that is agreed to that the website will be available and that it will “handle the load” to a reasonable extent. But what is our recourse when things go wrong? What happens when the SLA is not met? In the case of our wonderfully mysterious Australians, they foot the bill for the whole thing, plus remediation through their taxes. Nobody was fired at the service provider for the debacle (according to sources). And life goes on. The rather lame answer to how the event could have been avoided was “Maybe we could have turned it (the router) off and back on again.”
Maybe, just maybe, if the system was designed with resiliency in mind, it wouldn’t have happened? There is NO excuse in this day and age that a full backup system couldn’t have been running in another datacenter, behind a different router. There is no reason why the website couldn’t have failed over within minutes to Amazon Web Services, load balanced with Microsoft Azure! Let me correct myself; there is a reason: The outsourced company may or may not have chosen technology that could do this because it wasn’t part of their portfolio? The budget from the ABS may have been a little short for redundancy?
There have been some terms bandied about in the technical world for quite some time that incite me to violence. The first is “one throat to choke.” When used by an IT director/CIO, one must add the line “other than mine” to that phrase, to get closer to the meaning. Lack of accountability is the first benefit and detriment of outsourcing. A $30 million hand slap to the vendor to make a situation go away to pave the way for the next big contract is the stark reality of these kinds of failures. Who paid? The Australian people (including the Hemsworths and that Wolverine guy.) In this case, things went bad, and no throats were choked. A lot of fingers were pointed, and eventually came back full circle in a wave. It’s funny in a clown show, but not for $470 million.
I’m not on the ABS remediation team and I wasn’t involved with the project in any way. But I am (and my customers are too!) constantly impacted by huge companies winning contracts to do mediocre work with half-baked technology and somehow staying in business. Customers push for the lowest-cost bid and wonder why they get the lowest quality in return. Thoughtful architecture in developing resilient IT systems is an absolute requirement, whether you are building a $470 million website project or a $40,000 small-business server farm. The Australian people were impacted (mostly financially) by a single-point-of-failure in a system. There were $30 million in reparations for a $30K piece of equipment that was mismanaged and under-tested. There were apparently no real tests for false-positives, no analytics, no intervention, and a serious lack of design intelligence that SHOULD be the take-home of this incident.
My product, FreeStor, would not have stopped the router from shutting down, but I could have helped instantly shift workloads to alternate datacenters behind active routers. My platform could have spawned instantly accessible test platforms while production was running to look for impacts to data systems. I’m not saying any of those things could have sidestepped the problem in and of themselves because the real problem lay in planning for disaster, not for the best outcome. Let me say that again, with more clarity: The real problem was not technology, but in planning for and budgeting for worst-case outcomes. Every system today needs to be designed with a multiple vendor, multiple location perspective. Point products (products that can only do one thing) cannot be allowed to operate in oversight vacuums and must integrate or be monitored as part of an intelligent system.
Demand more from your infrastructure and stop operating to the limitations of your service providers and their technology. Demand accountability from vendors, providers, and your tools.