How Much Availability is Enough

At Oracle OpenWorld 17, Larry Ellison announced “Oracle Autonomous Database Cloud” and “Oracle Autonomous Data Warehouse Cloud”.     Among other features, Oracle guarantees these databases will have an SLA of 99.995% or less than 30 minutes of “costly planned and unplanned downtime” a year.

For applications that require that level of availability, this sets a very high bar.  However, this sort of availability comes with a cost; and not all applications require this level of availability.   In this blog, I’ll offer some suggestions on how to determine the correct levels of availability for your applications, and suggest some ways you can obtain those levels, at significantly less cost and complexity.

It’s Not Just the Database

When we consider availability for database systems, we really need to consider that while the database may be an integral component of an application that delivers value to end-users; it’s most often not the only component.   Even if we configure the hardware underlying the database, or the entire “technology stack” for high-availability, there will components of the application outside of the data center, and possibly even not IT related.  For example, if the building occupied by users of an application is destroyed by a fire, and the application is not accessible via an external network, the application is unavailable.

Set a Reasonable SLA

The first step in setting realistic SLA, or Service Level Agreement, which defines when an application must be available, is to consider the business use of the application.   For example, an accounting system used by folks who work 8-5, Monday through Friday in a single timezone will have a very different SLA than an on-line order entry system for a company that trades in all timezones. 

Next, you’ll want to consider the overall cost of downtime.  Again, an accounting system with manual backup procedures, or where processes can be postponed probably has a lower cost of downtime than an on-line shopping system where an inaccessible system will probably result in a lost order.

You may want to consider risks other than cost as well.   For example, the risks associated with an accounting system being unavailable is considerably less than the risks associated with the availability of a real-time Electronic Medical Record system.   When evaluating risks, you may want to use one of the many available risk models or frameworks, such as “The Risk IT Framwork” from ISACA, which can be found here.    For very high-value or high-risk systems, risk analysis may involve fairly complex probability calculations, and even gaming theory.

Also consider what alternatives are available to replace, even temporarily, the use of the application.   For example, orders can be taking using handwritten or typed forms; checks can be handwritten, or data entry can be postponed.    For systems that capture data from other automated systems, such as telephone call detail records, perhaps those transactions can be stored in a portion of the system “upstream” of the database.   As you consider the alternatives, don’t forget to include the costs of the alternatives, including opportunity costs and the costs of maintaining the alternatives.

Consider the lifecycle of the data stored in the database.   Some examples of things to consider are the volatility of the data or how often records change.  Availability strategies will be very different for data that changes frequently or needs to be updateable for its entire lifecycle than for data such as records of payment, that are basically written once.   It’s also possible that the data store in the database may also be stored in other locations, either upstream or downstream of the database you’re considering.

Two critical components of an SLA are the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO).  The Recovery Time Objective sets the “allowed” downtime before an application absolutely must be recovered and available after an outage.  The Recovery Point Objective sets expectations for the allowed data loss.  RTOs are almost always expressed as units of time, such as seconds, minutes or hours.  RPOs can be expressed as either units of time or as units of work, such as transactions.   Note that there may be multiple RTOs and RPOs, dependent on the severity of outage.   For example, the RTO/RPO for an outage of a single server in a data center might be less than the RTO/RPO for an area-wide disaster, such as an earthquake or flood.  As you develop your strategy, you will use the RTO and RPO specifications to validate the effectiveness of proposed strategies.

Develop a Strategy

Once you understand your application SLA requirements, and the nature of the data; you can develop a strategy to ensure that the SLA availability requirements can be met. 

For a very simple case, such as an 8-5 accounting system, with large windows of inactivity, a simple nightly backup, with tested restore processes, may be sufficient.   On the other end of the spectrum, for high-value or high-risk applications that require near-zero RTOs and RPOs; even a single instance in the Oracle Autonomous Cloud might not provide sufficient availability.

Availability for applications “in the middle”, between these extremes, can be solved with a number of solutions, some more complex or costly than others. 

For applications with RTOs measured in minutes and RPOs in small numbers of transactions, that have some “downtime windows” available, a simple RAC environment may suffice for local high-availability; but, then you’ll need to consider recovery in the event your primary data center becomes unavailable.    Oracle’s Cloud solutions may look attractive for these cases.

SharePlex – The Gold Standard

For those applications in the middle, where you need both local high-availability and disaster recovery; there’s actually another alternative to putting everything in “the cloud”.    SharePlex, Quest’s award-wining database replication product, can be used to ensure high-availability and rapid recovery for both local and remote databases; either on premise or in “the cloud”.  It can be used to facilitate migrations where the only downtime required is the time needed to move application connections from one database to another.   It can be used to provide near-real time access to data across multiple databases, thereby eliminating costly and time-consuming ETL processes.   And the great news is that all of this can be done at a cost that’s usually significantly less than equivalent Oracle solutions.   For more information about SharePlex, click here