Although IT disasters are unpredictable, disaster recovery shouldn't be. In fact, recovery should be planned, predictable and controlled. The following steps will help you organize your thoughts, ask the right questions, and develop the right strategy to build a DR plan that is closely aligned with your business.
A plan to recovery from a disaster should always start with an inventory of all your IT assets. This is necessary to untangle the complexity of your environment. Start by listing all the assets under IT management, including all servers, storage devices, applications, data, network switches, access points, and network appliances. Then map where each asset is physically located, which network it is on, and identify any dependencies. Here is an example:
Once you have mapped out all your IT assets, networks, and their dependencies, go through each and list the internal and external threats to each of those assets. Imagine the worst case scenario — and be thorough. These threats could include natural disasters or mundane IT failures.
Next, include the probability that that event may happen and the impact it would likely have if that event were to occur. How will it affect business continuity if each scenario were to occur? This is also a good time to enlist the help of your colleagues. Just remember to emphasize the fact that mundane events happen much more frequently than natural disasters. Move the conversation away from earthquakes and hurricanes and more toward the higher probability that the location will experience a power outage or IT hardware failure. Here is an example:
Before you begin to build out your IT disaster recovery plan, you’ll need to classify your data and applications according to their criticality. Start by speaking to your colleagues and support staff to determine the criticality of each application and data set.
Look for commonalities and group them according to the criticality to your business continuity, frequency of change, and retention policy. You do not want to apply a different technique to every individual application or dataset that you have. Grouping your data into classes with similar characteristics will allow you to implement a less complex strategy to recover.
Classifying data in a vacuum based on assumptions may come back to haunt you. Be sure to involve other business managers and support staff in this planning exercise. You will undoubtedly have to make some trade-offs to limit the number of data classes you have. For medium-sized businesses, the number of classes should likely be between three and five. Here is an example:
Different classes will have different recovery objectives. For instance, a critical ecommerce database may be critical to recover and have very aggressive recovery objectives because the business simply can’t afford to lose any transactions or be down for long. On the other hand, a legacy internal system may have less stringent recovery objectives and be less important to recover since the data doesn’t change very often and it’s less critical to get back online.
This is the step where many IT professionals fall short. Setting recovery objectives without consulting the business line managers is the number one cause for misalignment. It’s imperative that you involve them in this process to ensure the business can recover properly during a disaster.
Here is a sample list of questions you can ask your business colleagues:• What applications and data does your department use?• What is your tolerance for downtime for each?• What is your tolerance for data loss for each?• Are there times when these applications are not being used by employees partners or customers?• Would you ever need to restore data that is older than 90 days old? How about 6 months old? How about 1 year old?• Are there any requirements (internal or external [i.e industry or regulatory]) for the organization to retain the data for a designated period of time?• Are there any requirements (internal or external [i.e industry or regulatory]) that prevent us from moving the data from one geographical region to another?• Are there any requirements (internal or external [i.e industry or regulatory]) with regard to security and encryption?
The key here is to understand business needs and provide a differentiated level of service availability based on priority. Now that you have that information hand, it needs to be translated into recovery objectives to be included in your disaster plan.
Recovery time objective (RTO) — What is the acceptable time any of your data and production systems can be unavailable? This is your recovery time objective. To calculate the RTO for an application, consider how much revenue your organization would lose if the application went down for a given length of time. For example, how much would you lose if your customer portal went down for an hour, or a day? How much cost would be incurred if none of your employees can work because email is down?
Calculating your RTO is necessary for determining the features you need in your data protection systems and products. For example, if you have a very high RTO (say, more than four hours), you will probably have time to back up from tape, but if you have a very low RTO (such as just a few minutes), you need to use host-based replication or a disk-based backup with continuous data protection features.
Recovery point objective (RPO) — What is the acceptable amount of data your organization can afford to lose? That is your recovery point objective. If your organization has a high tolerance for data loss, your recovery point objective (RPO) can be high, from hours to days. If your business can’t afford to lose any data, or very little, your RPO will be seconds.
The RPO you set will determine the minimum frequency for backing up your data. If you can only afford to lose an hour’s worth of data, you should back up the data at least every hour. That way, if an outage begins, for example, at 2:30 p.m., you can recover the 2:00 p.m. backup and meet the RPO requirement.
Once you have identified all your IT assets, mapped their dependencies, and grouped them together based on their criticality and recovery objectives, it’s now time to choose what tools and techniques to use.
The good news is that a wide array of solutions is on the market today. Just make sure that what you choose offers the appropriate level of protection. Over-protection can cost the company needless money and introduce unnecessary complexity. (Complexity is the enemy of productivity and will likely increase the possibility for human error.) Under-protection can be equally bad since it will put your business continuity at risk.
For instance, nightly backups using traditional (file-based) methods are more than sufficient for low-impact data, but would be inappropriate for high-impact data and applications. A CDP solution is great for high-impact data and systems, but it can add overhead to production servers and storage costs. Perhaps the most critical component of your backup and disaster recovery plan is offsite protection. This should be used regardless of the type of data backup method you choose. The method (be it tape vaulting service or replication to the cloud) should be commensurate to your recovery objectives. Make sure your data is sent to a location that is far enough away that it is not in the same geographic risk zone. Typically, this is at least 25 miles away from the primary location.
Finally, automate and streamline the recovery process as much as you can. In the event of a disaster, key IT staff may be unavailable. Automation also lessens the risk of human error.
Go beyond the walls of the data center and involve key stakeholders for all your business units (i.e. application owners and business managers). They need to be involved in the planning phase. And they should agree with you on the company’s priorities and service level agreements (SLAs) your team will provide.
Also, consult your strategic partners and vendors to make sure you’re getting the most out of your DR solution and/or services. When two servers failed at the Orleans Parish in New Orleans, causing the loss of critical conveyance and mortgage records dating back to the 1980s, IT staff hadn’t been keeping in close contact with the parish’s cloud backup / DRaaS provider. Similarly, when web hosting provider DreamHost had an outage, the company identified the source of the problem to the vendor that manages its data center. Be sure not to make that mistake and stay in close contact with any vendor you employ.
Once you have consulted all of the key stakeholders, enlist an executive-level sponsor who will get behind you and the project. The importance of collaboration, consensus and executive support to your disaster plan’s success cannot be emphasized enough.
In a disaster scenario, you need a documented strategy for how to get back to a working state. This document should be written for the people who will use it.
Communicate your plan. All too often, only one person in the organization really knows the whole picture, leaving the organization vulnerable if that one person is unavailable during a disaster. In addition, be sure to store your DR strategy where it can be accessed during a disaster — not on public share in your Exchange folders. Ideally, it should be printed and posted in multiple locations.
People often say, “Practice makes perfect.” A better saying might be, “Practice makes progress.” No organization ever gets to perfection with its disaster plan, but practice will help you find and rectify problems in your plan, as well as enable you to execute it faster and more accurately. Make sure that everyone who has a role to play attends the practice sessions, even if you hold them, for example, on Saturdays.
You do not need to practice executing the full disaster recovery plan every time. It’s perfectly acceptable to carve out pieces of your plan to test. Here is an example:
A DR plan should be a living document. It’s especially important to regularly review your plan given the shifting sands of an ever-changing business environment. Tolerance for downtime and data loss may decline. Key personnel may go on leave or terminate their employment. IT might migrate to new hardware or operating systems. The company might acquire another company. Your planning needs to reflect the current state of the organization.