Thursday, October 08, 2009

Disaster dead?

I work in the computer industry and provide QA and support for a number of web portals. A few weeks ago, as part of an internal security review one of our clients requested a copy of our Disaster Recovery Plan for one of those web applications. I wasn't terribly surprised to discover that we didn't have a formal plan drafted. About a week prior to that, they had asked for information related to our uptime policies, server redundancy, hosting facilities and backup plan.

With regards to the disaster recovery, I reiterated our backup systems and redundancy plan and let them know that we would handle a disaster the same way we would handle any error on the system…that we would use our code repository, code backups, database backups and redundant virtual servers to get the site back up and running in as short a time as possible with minimal data loss. Naturally, this quick blurb didn't suffice and they wanted a formally drafted plan which eventually included very detailed schema of low level operations in place.

Today, this article came across my email indicating that "disaster recovery is dead." As it states in the first paragraph, that's a rather interesting claim, especially when you consider that it is coming from the director at the International Disaster Recovery Institute.

This article explains what I was trying to get across to our client's security team, but it does so much more eloquently.

A dozen or more years ago, the Internet was still a bit shaky and it was understandable for sites to have outages, be laggy or otherwise have bouts of unreliableness (is that a word?). These blips of instability were not considered disasters. The disasters of the early 90s were more in line with the "natural" disasters one thinks about: floods, earthquakes, terrorism, war, alien invasion, etc.

Over the past dozen+ years, however, the world has shifted significantly such that even a flicker of a site outage is considered a disaster. As such, we are always in a sort of "disaster recovery" mode.

Rather than continuing to call it "disaster recovery", the article suggests we just integrate this policy into the policy of "continuous business operation", and I would tend to agree. In today's world, disaster recovery is now an integral part of the business policy for "100% reliability" provided via constant monitoring, redundant servers and frequent backups.

While all of this makes sense, it also brings to the forefront one of the things that most bothers me about this industry. There is no downtime. Indeed there cannot be any downtime. Which makes for a rather stressful work environment because even with a very robust action plan to provide continuous uptime, there are always times when something will go wrong that wasn't accounted for. And even a very fluid plan can be stressful to put into action. Depending on the number of people affected, even a 5 minute outage can be disastrous and have stakeholders and upper management breathing down your neck to get things fixed.

It's at times like these (like the 'fires' of this past week *grimace* that I long for a slower paced world where it's not the end of the world if a system goes down or has an error. It's bad, yes, but it's understandable and everyone knows that it will be back when it's back and that everyone is doing all they can to get it back up.


No comments: