Customers often ask, “What’s the difference between high availability and disaster recovery?” and “Can I use the same solution to achieve both?”
Wikipedia defines high availability as “a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period”. Let’s dissect this a bit.
First, we’re talking about a system, not just a database, or a web server, or a disk. If your organization processes orders from customers, your users will not consider your system “available” if the database is up; but the web server that displays the order pages is not. We’re also talking about “….uptime, for a higher than normal period.” We need to define “normal.”
“Normal” can vary depending on your application and use case. For example, if you’re running a system that supports accounting, and all the accountants work from 8:00 am to 5:00 pm Monday through Friday, then your normal is probably 8:00 am to 5:00 pm Monday through Friday. On the other hand, if you’re running a system that supports first responder dispatch, your normal is most likely 24 hours a day, 7 days a week.
The definition also talks about “operational performance.” This implies that you’ll need to define that for your organization as well. Does your organization have documented service level agreements (SLAs)? As we’ll see later, these are critical to measuring high availability and disaster recovery.
Obviously, these will impact how you design and engineer your systems, and how much high availability will cost.
Wikipedia defines disaster recovery as “a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.” It goes on to say that “Disaster recovery focuses on the IT or technology systems supporting critical business functions.”
To dissect this, here’s what Wikipedia says about a disaster. “A disaster is a serious disruption, occurring over a relatively short time, of the functioning of a community or a society involving widespread human, material, economic or environmental loss and impacts, which exceeds the ability of the affected community or society to cope using its own resources.”
A things to note here. A disaster has to be serious. A fire in a data center could be serious, but what if it’s just popcorn in the microwave, as opposed to a power supply failure that fills the whole room with smoke and trips the sprinkler system? Or, what about a power outage, if you have a backup generator that starts and enough fuel for a week?
Next, and perhaps most important, a disaster has to be widespread and “exceed(s) the ability of the affected community….to cope using its own resources.” Again, losing a single server is probably not a disaster, although with lack of planning, it might turn into one.
High Availability versus Disaster Recovery
Now that we’ve defined high availability and disaster recovery, let’s look at some differences and similarities.
Both HA and DR can be considered subsets of business continuity, or how we ensure that our business operations can continue in the event something bad happens.
A core component of successful HA and DR programs is redundancy, or elimination of single points of failure. For the database component of our systems, both HA and DR usually involve making copies of the database; but for different reasons (see below for differences).
Another key element of HA and DR is risk assessment; which leads to costing and cost comparisons. The risk of an earthquake is quite high in some areas of the country, and almost non-existent in others. The cost of recovering from a single server failure is significantly less than the cost of rebuilding a data center after a fire. Cost versus risk evaluations allow you to build appropriate HR and DR programs without breaking the bank.
Both HA and DR systems should have agreed upon objectives and measures; such as availability for HA systems, and recovery point and recovery time objectives for DR. We’ll define some of these measures in the next section.
Just looking at the definitions, HA is all about systems and how they’re designed, but DR is all about policies, tools and procedures. When building systems for HA, we try to prevent failure of the overall system by eliminating single points of failure and automating failover or recovery procedures.
However, when building disaster recovery systems; the assumption is that the primary system has failed and that recovery of that system will take some time.
This ties back to our objectives and measures. For HA systems, these are typically defined by availability which is usually expressed as a percentage of the expected system availability time. For example, if we have a system that’s supposed to be available from 8:00 am to 5:00 pm, 5 days per week; that’s 9 hours per day or 45 hours per week. An availability of 99.99% (four nines) for such a system would allow about 16 seconds of downtime per week. For a 24x7 system, we would be allowed a little over 6 seconds per week.
On the other hand, DR systems typically have measures of recovery times and recovery points. For example, we might want to be able to recover our order entry systems from a data center fire in an hour and lose only 5 minutes’ worth of transactions.
Once we’ve defined our terms and set our objectives and measurements, we can begin to design systems that meet those objectives.
For HA systems, we’ll want to design failover or switching between systems that meets our availability objectives. If we’re trying to achieve “four nines,” even in an 8-5 system, this almost certainly means eliminating all single points of failure and automating failover.
For DR, we need to make sure our systems can survive a disaster, which usually means building a second system in a location removed from the primary so that local events such as weather, earthquakes or meteors won’t damage both systems. A disaster recovery failover will be different from an high availability failover, in part due to distances between the two systems.
Be sure to read the next two posts in this series to learn more about how SharePlex can help with both high availability and disaster recovery.