Disaster recovery (DR) is a constant conversation with customers. At the Microsoft Ignite conference this year, Manoj Kumar Jain (Principal Program Manager) referenced a few statistics on disaster recovery: Most organizations experience four-plus disruptions each year and the average cost of disruption is $1.5 million an hour. With the increasing popularity of public clouds, the question arises as to how to take advantage of them for disaster recovery. Let’s discuss the key elements of a DR solution, whether using a public cloud for disaster recovery can help, as well as things you need to consider.
Disaster recovery is the determination of what applications and data are critical to have available during an emergency so that your business can survive the emergency and then build a solution around it. First, it is important to understand the difference between survival and operational. The intent of disaster recovery is to put together and execute a plan that allows for the business to continue to run, rather than run in a business as usual mode. As noted in the references by Manoj, 40 percent of businesses do not reopen after a major disaster; DR is to ensure you are in that 60 percent group.
Performing a VM or SAN copy is not a disaster recovery solution – it is a backup solution. Disaster recovery is intended to get your organization up and functioning during a disaster, so straight replication may or may not accomplish that. Planning for a disaster and then knowing how to implement that plan is part of an overarching business continuity strategy. If one does not know what the business needs are in time of an emergency, it is near impossible to plan for that emergency. Disaster recovery is just part of the Business Continuity Plan. Other factors are important, such as where people are to go during an emergency, who has authority during a crisis (Business Response Team), what the Business Impact Analysis indicates the risks are, and what the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) are. The disaster recovery plan is then a result of that business continuity strategy. For all intents and purposes, let’s assume the Business Continuity Plan has already been developed for this post.
One outcome that the Business Continuity Plan achieves is determining what applications are needed during an emergency – and which ones are superfluous. Usually this is done in a tiered approach, giving a tier level of importance to applications. This information is crucial in determining what virtual machines to replicate to the cloud.
If you take a look at the diagram below, it shows areas that need to be considered and in each area, it is assumed that the applications being replicated are only those necessary as defined in the Business Continuity Plan.
It will become obvious that using a public cloud as a total DR solution is most likely not possible, unless your organization is completely virtualized or virtualized enough that the non-virtual applications are not needed during a disaster. For some small and medium-sized businesses, that may well be the case. But for a public cloud to be effective in this scenario, the vendor must have tools or services that are specifically designed for disaster recovery.
- Can they replicate them to their cloud and can they isolate them from the production VMs you have?
- Can they replicate back to your data center afterwards?
- Do they have an orchestration or process like a RunBook to organize your VMs and monitor its replication?
- Is there a defined process to test the solution?
For a cloud to be a viable DR solution, these questions need to be answered. A constant issue with public cloud as a disaster recovery solution is accounting for physical servers. Can you replicate them to the cloud in a physical to virtual (P2V) manner? And if you can, do you want to? I believe the two biggest blockers for using a public cloud as a complete disaster recovery solution are the physical servers that are critical to the business and the legacy mainframe applications.
There is no reason that the public cloud cannot be used to augment your current overarching solution. As the vendors add more services, it will become easier to adapt. There are ways to use the cloud with physical servers – if you can be creative. Looking at the diagram above, replicating data and getting users access to it is the key to a successful plan. Of all the components in the diagram, there are a few things I want to point out.
First, what is not directly shown in the reference architecture is the need for failback once the emergency is over and there is a desire to return to normal business processes. In that case, any changes made to application data need to be replicated back. This is implied in the above architecture by the double arrows, with the arrow pointing to the on-premises environment as the failback pathway. With VMs, for instance, most top tier vendors will have a process that will allow for reversing the replication back from the cloud to the customer’s premises. Part of that solution needs to incorporate any databases, SANs or applications that are not part of the VM environment.
You can look at the numbers and their descriptions in the diagram and get a good deal of information as to their purpose. I want to point out a few key items that, in my opinion, are critical to any DR solution: the RunBook, DNS and User Access.
- RunBook: This is probably the most important element of the whole disaster recovery strategy. It determines the difference between just a copy of data and a disaster recovery plan. The RunBook can be physical or digital. It can be manual or automated, but it must exist. It describes what needs to be brought up and when during an emergency, in what order and what is dependent upon what. You might not want to bring up a server that connects to a database automatically until the database is up and running – or you might not want to bring up an application server unless the middleware server is available. This can (and should) be automated. Whether it is or not, it is imperative that one exists.
- DNS: Critical to an emergency is getting access to the disaster recovery environment. It will be difficult for users to remember to go to a “special place” while fires are burning, rivers are flooding or traffic is snarled. Keeping the business running is imperative. Having a plan for global DNS to re-route to a different data center during such times is essential to a smooth transition.
- User Access: Sometimes this is the most forgotten aspect of a DR plan, yet it is the most important. Everyone is worried about how to get servers running, how to get data they need and how to have the systems available to the users. However, they forget that during an emergency that connecting to a corporate VPN, getting onsite or getting to any site may not be possible. The disaster recovery plan should assume the user can only access the internet and that the corporate network is not available. Providing the end user access to the new environment should be part of the disaster recovery testing before the actual event occurs.
Disaster recovery can be a complicated solution to implement, especially if applications have cross-dependencies with other departments or agencies, are dependent upon a connection to a mainframe or have other requirements. Can a public cloud offer a complete solution? Yes, but that is hard to say for every business, because every business is different and every executive has a different definition of what business continuity is. The cloud can offer a viable alternative to expensive proprietary solutions and can certainly augment or even enhance your current solution.
In a future post, I will discuss specifics on how a disaster recovery strategy could be implemented with specific vendor clouds, but in the meantime, check out this disaster preparedness white paper or BizTech Magazine’s collection of DR-related articles.