Expect the unexpected with a solid cloud DR strategy
A comprehensive collection of articles, videos and more, hand-picked by our editors
The cloud might be the most overhyped technology in decades, but it can be extremely beneficial when it's used as part of an organization's disaster recovery (DR) plan. It's now possible for a company to create a cloud-based recovery site that can be used if the primary data center is incapacitated. There are a number of alternative strategies for using the cloud for disaster recovery and disaster prevention; we'll describe the key considerations in detail.
1. Evaluate your data protection needs
The first step in implementing any DR offering is to evaluate your organization's needs. While this might sound simple (and may be), the results of your evaluation will be the major factors that determine the infrastructure and configuration you'll have to put in place to facilitate cloud-based data protection. For example, some organizations use cloud storage as part of a disk-to-disk-to-cloud backup solution. The primary backups remain onsite, but they're replicated to cloud storage where they'll be protected from things that could disable a data center, such as a fire or flood. Other firms replicate entire virtual machines (VMs) to the cloud so they can be spun up and hosted in the cloud if it isn't possible to continue to host the VMs in the local data center.
2. Choose your cloud provider
Once you've determined your data protection needs, the next step is to identify cloud providers that can accommodate those needs. Not every cloud provider is equipped to handle every situation. For example, some providers will allow you to replicate a VM, but won't host it. Similarly, there are providers that offer storage, but little else. If your goal is to build a cloud-based disaster recovery site, you'll need to find a cloud service provider that offers the specific capabilities you need.
On the other hand, if your only goal is to replicate data to the cloud, you might be better off subscribing to a storage-only plan to avoid paying for services and capabilities that you don't need.
Whatever your needs are, it's a good idea to identify several different cloud service providers so you can compare costs and levels of service reliability.
3. Estimate the costs
Once a viable cloud provider has been identified, you'll need to estimate the cost of using cloud-based disaster recovery. Each cloud service provider typically has a unique pricing model, but the total monthly cost usually comprises some combination of the following factors:
- A monthly subscription fee
- The amount of Internet bandwidth used
- The amount of storage space consumed
- The number of VMs (or virtual processors)
Some providers treat the subscription fee as pro-rated payment for service. For instance, the monthly subscription fee might include a specific amount of bandwidth usage, with any usage beyond that level resulting in additional costs.
It's also a good idea to check the provider's policy toward VMs that aren't powered on. Some providers charge based on the number and type of VMs created, regardless of whether those virtual machines are powered on or not. Other providers charge only for actual usage and therefore offer a billing structure that's based on the number of minutes or hours for which a VM is powered on.
Disaster recovery: A new app for the cloud
Cloud-based disaster recovery (DR) usually means replicating data or even entire virtual machines to the cloud. However, for companies that already have a secondary data center for DR in place, it might make more sense to use cloud services as a mechanism for facilitating the DR process rather than using the cloud as a data repository. Microsoft Corp., for example, has introduced the Hyper-V Recovery Manager on Windows Azure.
The Hyper-V Recovery Manager is a hybrid service that allows you to use Windows Azure to manage the replication process between your primary and secondary data centers.
Hyper-V Recovery Manager is intended to replace storage vendors' proprietary SAN-to-SAN replication. Instead, replication is performed at the hypervisor level using the native Hyper-V 3.0 replica feature. Although the virtual machine replication process occurs between data centers, Windows Azure is used as a cloud-based solution for managing the replication, failover and DR testing process. Administrators are able to define a series of recovery plans within the Windows Azure interface, and use the interface to perform failovers and failover testing. Azure communicates with on-premises System Center Virtual Machine Manager (VMM) server deployments (one in each data center). VMM, in turn, performs the heavy lifting by instructing individual host servers to replicate data, perform a failover and so on. This can be an especially effective solution for organizations that want to take some of the cost and complexity out of failovers to an alternate data center.
The way a cloud provider bills for services can have a major impact on the cost of your cloud-based disaster recovery initiative. That's why it's so important to develop accurate cost estimates before signing on with a provider. Although some providers use very complicated billing formulas, there are tools available to help you estimate the costs. For example, Microsoft Corp. offers a cost estimate tool for Windows Azure. Similarly, Amazon Web Services offers a cost calculator. Also, some third-party tools have built-in cost estimators. Veeam Backup and Replication Cloud Edition, for example, has its own cost-estimating calculator that works with a number of different cloud providers.
4. Develop a bandwidth management strategy
Another important step in taking your disaster recovery initiatives to the cloud is to come up with a strategy for managing Internet bandwidth. Bandwidth management is extremely important for a number of reasons, including:
- Many cloud service providers charge for bandwidth consumption.
- Your own Internet service provider may impose monthly usage caps or may charge for excess bandwidth usage.
- You must provide adequate bandwidth to allow data to be backed up (or replicated) in a timely manner.
- You must ensure that your cloud backups or replicated data don't consume so much bandwidth that other Internet usage suffers from inadequate bandwidth availability.
The method you'll use for bandwidth management depends on the approach you're using for copying data to the cloud. Some backup software and many cloud storage gateway appliances include built-in bandwidth scheduling features. Such features generally allow you to limit the overall bandwidth that the data-copying app consumes and may let you increase that limit during off-peak hours.
Similarly, many organizations use quality of service (QoS) to reserve bandwidth for cloud backups and other bandwidth-intensive services. This ensures that each Internet-based service receives the bandwidth it needs without consuming an excessive amount of the available bandwidth.
Regardless of the amount of bandwidth you reserve for your cloud-based backup or replication service, it's important to make efficient use of that bandwidth by also applying data deduplication.
5. Determine the logistical requirements
If your company is using the cloud solely for its storage capabilities, there will only be a minimal amount of logistical planning that will have to take place. However, organizations that wish to perform full-blown failovers to the cloud will need to take a number of considerations into account.
The actual logistics that must be planned can vary considerably depending upon your company's existing infrastructure, the cloud service being used and the desired end result. Even so, there are some aspects of the logistical planning process that are especially common.
The first consideration relates to how you will copy data from your on-premises data center resources to the cloud service. If you're using a public cloud, the replication process will most likely be software based. In any case, you'll have to use a replication mechanism that's supported by the cloud provider and also compatible with the on-premises resources you plan to replicate.
Another important consideration is Active Directory (AD) synchronization. Windows-based clustering solutions require cluster nodes to be members of a common Active Directory domain. This same concept also holds true for many other Microsoft fault-tolerant offerings (such as Exchange Server database availability groups) because the technologies are built on top of failover clustering components. That means that if you want to extend a cluster to the cloud for DR purposes, you will most likely need to extend an on-premises AD domain to the cloud.
The actual method used to accomplish this can vary from one cloud provider to the next. In the case of Windows Azure, Microsoft provides a cloud-based directory service—called the Windows Azure Active Directory—that can be synchronized to an on-premises Active Directory and an on-premises DNS server. The biggest trick to making the synchronization work is that the on-premises network and the virtual network that exists within the Windows Azure cloud must be able to communicate with one another. The easiest way to facilitate this communication is by deploying an on-premises Routing and Remote Access Server (or a VPN server) and configuring Windows Azure to attach to your on-premises network through the VPN.
Another big issue you'll have to address is that of the cluster maintaining quorum. Windows failover clusters are based on the Majority Node Set model. This means that for a cluster to maintain quorum, half plus one of the cluster nodes must remain online. For example, if a cluster has seven nodes, then four nodes must remain functional for the cluster to continue to function.
The problem with this concept is that Windows can't tell the difference between a multinode failure and a WAN link failure. Placing the majority of the nodes on the premises would prevent a cloud failover from occurring in the event of a WAN link failure, but it wouldn't protect you against a data center-level failure because an insufficient number of nodes would exist in the cloud. Similarly, if the bulk of the nodes existed in the cloud, then a WAN link failure would trigger a failover to the cloud. One common solution to this problem is to place an equal number of nodes on the premises and in the cloud and to then place a "tie breaker" node in a third location. Doing so ensures that one of the two sites will always be able to maintain quorum, and reduces the chances of link-related failovers. It's important to make sure that all three locations can communicate with one another at a speed that adheres to Microsoft's latency requirements for clustering.
6. Virtual machine replication
Clustering isn't appropriate for every situation, and not every application or virtual server can be clustered. An alternative technique for using the cloud for disaster recovery is to simply replicate VMs to the cloud. If an organization uses this approach, they must determine what they hope to gain from the replication process. For example, VM-level replication can provide the following benefits:
- Point-in-time-image-based recovery
- The ability to mount a cloud-based copy of a VM and extract data
- The ability to redirect users to a cloud-based VM replica in the event of an on-premises failure
If the goal is to redirect users to a cloud-based VM in the event of a failure, then the biggest challenges you'll face are related to IP address injection and DNS record modifications. To be usable, the VMs will need IP addresses that are local to the cloud-based virtual network subnet on which they will reside during a failover. DNS record modifications are required so the virtual server can be found when it's running in the cloud.
The major hypervisor vendors offer features for performing these tasks and redirecting the user workload, but some third-party backup vendors offer similar capabilities that can be used without admins having to configure IP address injections and DNS modifications.
About the author:
Brien Posey is a Microsoft MVP with two decades of IT experience. Before becoming a freelance technical writer, Brien worked as a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the nation's largest insurance companies and for the Department of Defense at Fort Knox.