THE AUSTRALIAN TAX OFFICE HAS A BAD DAY
By Pete McCallum
Have you ever had a truly horrible day? The thing about bad days is that it’s usually a singular moment that was exceptionally horrible, buffered by hours of regret, second-guessing and replaying the event. I was thinking about a moment my life changed in IT; when I went from being a helpdesk jockey to eventually becoming the lead architect and thought-leader for system design in my company. The moment occurred when the backup guy, who sat next to me, mumbled “uh oh” under his breath and ran somewhere. All of a sudden, my user drive and team drive went offline. Dust stopped moving in the air, and everything halted for a few seconds, while my future realigned itself to a new reality.
We had a SAN failure. Multiple Terabytes of government data for defense programs simply vanished in a failure of equipment. At roughly $1 million an hour (cost to business) for outages, this falls into the category of truly horrible moments. Well, for the other guy, right? The cause was that both RAID controllers in the SAN failed within 15 seconds of each other. It’s almost incomprehensible, but there it is. What truly made the moment horrible is that this guy was trying to remediate a problem with backups that he knew about. It was a small problem: they weren’t finishing, and hadn’t in a long time. Data was lost, and it took over two weeks to recover what we could from multiple sources. It was a serious black eye on the capabilities of our IT department. People lost jobs. On the other hand, other people gained new responsibilities and budget was finally allocated for resiliency and redundancy. It's funny how that works: Loss creates budget.
Shifting gears just slightly: Have you ever really hoped that your local Tax office would simply lose your records? What if your mortgage records just went away and you never had to pay again? It’s a little fantasy that (I’d like to believe) many people have the closer to tax day one finds oneself. Interestingly, this may have just happened to 23+million Australians! They have everything down there! Cool accents, great beaches, amazing animals, and rugby. And now they have an offline tax office. I’m packing my bags!
Let me give you some of the details before you think I’m overly gleeful to see this. It is a serious issue, and a very big deal. Taxes pay for things Australians need. You see, the Australian Tax Office ran for many years on a heterogeneous infrastructure platform (that means: many vendors and products working together) until they tossed the old and went with a single-vendor strategy on what was, at the time, a “true active/active symmetric highly available environment.” Remember the Titanic? It was too.
In mid-December, the Australian Tax Office (ATO) disclosed that their new arrays had an “issue” that put at risk over 1PB of public data. Bad day. Truly horrible moment. As it turns out, one storage system experienced data corruption and replicated it to the other. Have you ever seen what happens in a box of strawberries when one goes bad? Everything around it goes bad pretty quickly. Ten days later, ATO was still trying to restore data from backup tapes. For weeks following the incident, there were maintenance outages as systems were brought back online and repaired. Just google “ATO website outage" and you can read all about it.
I want to be clear that I’m specifically bashing the consultant who advised on a single-vendor strategy and array-based data services. I happen to have an inside scoop on this because I happen to know a guy who has a carefully typed out “I Told You So” letter that he may or may not ever send. Remember, the tax office went from a multi-vendor system with different technology bridged by an “abstraction layer” software to provide services. In this paradigm, individual hardware failures are far less impactful. Think on this: if you have two of the same items, a flaw in one is a flaw in the other. Activating that flaw is highly likely to activate the flaw in the other. Voila! A box of strawberries. This “flaw-sharing” issue is compounded by the fact that these arrays (and almost every SAN out there on the market) are only designed to talk to each other. A 3PAR SAN cannot replicate to an EMC SAN, any deduplication and snapshots are all located within the same box, and any reporting or early-warning systems only apply to itself. It’s a systemic condition designed to retain customers and force a single-vendor affinity!
With the problem well defined, I do want to talk about a solution. 1PB of data is nothing to scoff at, and a system crash is truly a horrible day (even for your competition). The ATO made a business decision based on what they thought was great technology and budget. The problem is that business tends not to think large scale with IT. We don’t buy from a perspective of interconnected systems. We talk about performance and capacity and savings and efficiency, but we don’t look at the overall impact of failure scenarios and remediation. And why would a salesperson offer that up for discussion? I have spoken with hundreds of prospects and clients who ran perfect business continuity/disaster recovery on paired systems, only to find disaster replicated, and backup systems broken as they transform their businesses. The cloud marketing has pointed out that IT has failed business because of the lack of agility in systems and ability to transform, but it is the very single-vendor, xenophobic system mentality that has been championed by analysts and embraced by business for years that has “broken” IT.
I often end my articles and blogs with the admonishment to “demand more from your IT, service providers, and technology.” I’m just going to come out and say it: Stop making ridiculous decisions based on half-thought-out architecture. The Australian Tax Office not only suffered a service outage, but it impacted hundreds of thousands of business and millions of taxpayers who rely on the service! If the ATO stayed on a multi-vendor platform, designed for interoperability, redundancy, and continuity, this outage would not have happened. Think on this: If my production system also holds my primary backups and manages my site-to-site replication, and it goes down… what do I do? The very resilience that I thought I had is broken all at once – just like the Titanic!!! The backup systems broke with the production!! You’d think we would have learned this lesson?
My product, FreeStor, is a storage abstraction platform, very much like the one the ATO used to use before this incident. I adopted an earlier version of this platform back in the clean-up days following the disaster in my prior job. I was able to make sure that any given storage system had full redundancy with a different vendor tool (EMC mirrored to HDS was my choice at the time). All of my snapshots were on separate storage, and I was able to cross-replicate all of my sites for full system resiliency and a 15-minute return to business SLA. I used a software VTL to speed up backups and restores and offloaded to physical tape periodically for a tiered recovery approach (immediate-, short-, and long-term recovery) spread across different technology and locations. During drills and real outages, I never missed my recovery SLA and proved to be quite a thorn in the side of the corporate boys who had difficulty meeting a 3-day window of recovery on traditional systems.
The ATO will recover, and the Australian tax professionals and tax payers will go back to business and life-as-normal. People may get fired, and (hopefully) the ATO will choose a different path to get the resilience they need. It’s a tough lesson learned that tools can and will fail. The true fault is in the hubris of thought that something is unsinkable because it is packaged well and offers a lower price.