Customers often ask, “Can SharePlex be used to help me achieve High Availablity” or “Can a SharePlex database target be used for Disaster Recovery?”
The short answer to both of these questions is “Yes!” This series of blogs will look in depth at how SharePlex can be used to help facilitate HA and DR.
Wikipedia defines “High Availability” as “a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period”. Let’s dissect this a bit.
First, we’re talking about a system, not just a database, or a web server, or a disk. If your company is involved in taking orders from customers, your users will not consider your system “available” if the database is up; but the web server that displays the order pages is not. We’re also talking about “….uptime, for a higher than normal period”. We need to define “normal”.
“Normal” is pretty variable, depending on your application and use case. For example, if you’re running a system that supports accounting, and all of the accountants work only 8-5 Monday through Friday, “normal” is probably 8-5 Monday through Friday. On the other hand, if you’re running a system that supports First Responder Dispatch, “normal” is most likely 7 days a week, 24 hours a day.
The definition also talks about “operational performance”. This implies that you’ll need to define that, for your organization. Does your organization have documented Service Level Agreements? As we’ll see later, these are critical to measuring High Availability or Disaster Recovery.
Obviously, these will impact how you design and engineer your systems, and how much High Availability will cost.
Wikipedia defines “Disaster Recovery” as “a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster”. It goes on to say that “Disaster recovery focuses on the IT or technology systems supporting critical business functions.”
To dissect this, here’s what Wikipedia says about a “Disaster”. “A disaster is a serious disruption, occurring over a relatively short time, of the functioning of a community or a society involving widespread human, material, economic or environmental loss and impacts, which exceeds the ability of the affected community or society to cope using its own resources”.
A things to note here. A disaster has to be “serious”. A fire in a Data Center is could be “serious”; but what if it’s just popcorn in the microwave, as opposed to a power supply failure that fills the whole room with smoke and trips the sprinkler system? Or, what about a power outage, if you have a backup generator that starts, and fuel for a week?
Next, and perhaps most important, a disaster has to be “widespread” and “exceed(s) the ability of the affected community….to cope using its own resources”. Again, losing a single server is probably not a “disaster”; although with lack of planning, it might turn into one; but (to paraphrase Scott Adams), if a meteor destroys your data center, that’s almost certainly a disaster.
HA vs DR
Now that we’ve defined these, let’s look at some differences and similarities.
Both HA and DR can be considered subsets of “Business Continuity” or, how we ensure that our business operations can continue in the event “something bad” happens.
A core component of successful HA and DR programs is redundancy, or elimination of single points of failure. For the database component of our systems, both HA and DR usually involve making copies of the database; but for different reasons (see below for differences).
Another key element of HA and DR is risk assessment; which leads to costing and cost comparisons. The risk of an earthquake is quite high in some areas of the country, and almost non-existent in others. The cost of recovering from a single server failure is significantly less than the cost of rebuilding a data center after a fire. Cost vs Risk evaluations will allow you to build appropriate HR and DR programs without “breaking the bank”.
Both HA and DR systems should have agreed upon objectives and measures; such as availability for HA systems, and Recovery Point and Recovery Time Objectives for DR. We’ll define some of these measures in the next section.
Just looking at the definitions, HA is all about systems and how they’re designed, but DR is all about policies, tools and procedures. When building systems for HA, we try to prevent failure of the overall system, by eliminating single points of failure and automating failover or recovery procedures.
However, when building DR systems; the assumption is that the primary system has failed and that recovery of that system will take some time.
This ties back to our objectives and measures. For HA systems, these are typically defined as “availability”; usually expressed as a percentage of the “expected” system availability time. For example, if we have a system that’s supposed to be available from 8AM to 5PM, 5 days per week; that’s 9 hours per day or 45 hours per week. An availability of 99.99% (four nines) for such a system would allow .0045 hours, or about 16 seconds of “downtime” per week. For a 7 x 24 system, we would be allowed a little over 6 seconds per week.
On the other hand, DR systems typically have measures of “Recovery Times” and “Recovery Points”. For example, we might want to be able to recover our order entry systems from a data center fire in an hour (Recovery Time Objective) and lose only 5 minutes worth of transactions (Recovery Point Objective).
Once we’ve defined our terms, and set out objectives and measurements, we can begin to design systems that meet those objectives.
For HA systems, we’ll want to design “failover” or switching between systems, that meets our availability objectives. If we’re trying to achieve “four nines”; even in an 8-5 system, this almost certainly means eliminating all single points of failure and automating failover.
For DR, we need to make sure our systems can survive a disaster, which usually means building a second system in a location removed from the primary; so that local events such as weather, earthquakes, or meteors won’t damage both systems. A DR “failover” will be different from an HA failover, in part due to distances between the two systems.
In my next blog, we’ll look specifically at HA, and see how SharePlex can be used to eliminate single points of failure and facilitate rapid failover.