I have been fortunate to work for some very complex and mature IT departments throughout my career, and one of the more important practices I saw within these IT shops was a solid disaster recovery and business continuity strategy. As an IT business person that has participated on all sides of the disaster recovery and business continuity (DR/BC) strategy, I want to share the five most frequently missed tasks when starting a DR/BC plan. Some of these components are common knowledge, but you would be surprised how many of my clients have not done this work.
1. Application Inventory and Dependence Mapping
It might be obvious that starting with the identification of the business solutions that need to be recovered is a good idea, however the list of individual applications does not show the entire picture. For example, if all your business solutions use Active Directory or some sort of Lightweight Directory Access Protocol (LDAP) provider within your network, then that not only needs to be on the list, but it must be linked to the apps that depend on it.
So this would be one of the first solutions that needs to be restored. Before you work on them to get up and running, applications need databases, application platforms, security platforms and monitoring solutions, as well. Starting with the identified apps will help you to see what infrastructure-type applications need to be added to the list and prioritized for DR.
2. The Fabled SLA
Ninety-five percent of the clients that I have worked with in the last five years (estimated to be around 2,500) answered “no” to the following question: Have you identified, negotiated and documented your service-level agreements (SLAs) with your business users for the business solutions you are providing? The size of companies I have personally worked with range from 300 end users to 150,000 end users. All of the IT shops I worked in had negotiated, documented and published SLAs with realistic service level objectives. This one is not as tough as it seems.
First, the reason why I say “negotiated” is because you have to come to an agreement with your end-user community on the effort and the cost it will take to restore services. I literally start every negotiation with my story of every conversation with the end-user community that goes something like this: “Mr./Ms. Business User, how long can this solution be down before we begin to impact our revenue?” The response is: “What do you mean ‘How long can I be down?’ It can’t be down at all. I need this solution to be always available. It can’t be down!” I then steer the conversation to the business at hand — determining how critical it is to the business — rather than paying to recover a solution that costs a lot when we don’t necessarily need it.
SLAs can be very simple to be effective. A simple spreadsheet of the following can be assembled quickly in most shops:
- App Name
- App Owner
- Recovery Point Objective (RPO)
- Recovery Time Objective (RTO)
- Application Dependencies
RPO is the point in time that the data is restored. To illustrate, if a system goes down at 10 a.m. on a Thursday, how much data, in terms of time, can your business afford to lose? Is one hour OK? Can you get by without everything from today? Or, are we good with our weekly backups? It’s OK to have a variation of all these; the point here is that we need to know how to architect the backups or replications to accommodate the business requirements.
RTO is the amount of time the system can be down before we, as an organization, stop making money. Always up is obviously the most expensive, but you can offer 15-minute RTO for a significant reduction in cost! Four hours, one day and longer are even more cost-effective.
With your SLAs documented, you will be able to easily map out your DR plan.
3. Identifying Standard DR Architectures Based on SLA Requirements
Often, my clients just assume that an investment in DR/BC is an expensive investment that may never realize any returns. DR/BC does not mean you have to buy double the infrastructure automatically. Once you have your SLAs defined for each system that needs to be recovered, simply sort by criticality and recovery time objective. The groups will become evident and the tiers can be identified easily. You should only need three to four tiers. This document will look something like this:
|App Name||App Owner||Criticality Factor||RTO (mins)||RPO (hrs)||Depen-dencies||DR Tier|
|Internet connection||Security team||1||0||0||Hardware, service provider||1|
|VMware vSphere||Server team||1||0||0||Hardware||1|
|Active Directory||Server team||1||0||0||vSphere||1|
|Centrify||App team||1||0||0||Internet connection||1|
|Exchange Email||App team||1||15||12||Active Directory/vSphere||2|
|SharePoint team sites – DR procedures||App team||1||15||0||Active Directory/vSphere||2|
|SharePoint intranet||App team||2||240||12||Active Directory/vSphere||2|
|ADP (Cloud)||HR||4||720||24||Active Directory/ internet||3|
1=Mission Critical, 5=Restore from Backup
Tier 1 = Always On/Redundant Systems
Here we have classified the systems with a 240-minute RTO as Tier 2, so they will actually be architected to achieve a 15-minute RTO. With these numbers, we will know if we need an Always On architecture, Active/Replicate (SRM), Backup/Restore, etc.
4. Final Cost Review
One of the gaps in an IT department approach to DR/BC is the measurement — a measurement of success, a measurement of business requirement satisfaction, and a final cost analysis to make sure the investment is required and matches the risk that we are trying to mitigate. Again, larger companies will have a business analyst role, a person that has been trained to do this kind of work. But for most companies, the role of business analyst is typically performed by an IT manager or director.
To simplify this step, reporting and reviewing the cost of each solutions’ DR plan to the appropriate stakeholders or line-of-business leaders before the investment is made assists with buy-in and understanding what is required to provide this type of service within the entire organization. This step also helps the IT team get a seat at the business strategy table!
5. Building a DR/BC Test Plan
Yes, you have to test this investment. Whether you identify it as the last phase of the project or you schedule resources to be separated from a normal day’s work, an investment in an “insurance solution” is worthless unless you can prove that it works. And even more so, you need to test the documented procedures several times. The most mature shops will test twice a year. The companies that have purposely invested in DR, at a minimum, test annually.
During the first test, the authors of the procedures must verify their instructions are correct. However, the next test needs to prove the procedures work for any one of your team members. Therefore, it’s best to use your newest IT employees to test the procedures and verify they work. We once negotiated with another IT department in another company to loan their employees, who were completely not familiar with our systems, procedures or culture, to perform our DR test, and in return, we would send resources for their DR test.
It might be considered torture, but you also need to test your business continuity procedures because you need to understand how your business will remain open and profitable in the event of an outage. And you should use the business users to test in a live business environment.
These are just the beginnings, we have not gone through the entire process, just the items that get you to a successful plan quicker. If you would like to talk to someone that has developed and deployed several DR/PC plans, contact your local CDW account and ask for a consultation with the CAS team.