Recovery, what?

As I work from home more than ever now, it is becoming blatantly evident that there are some processes that I need to get in order. Am I the only one suddenly realizing that my daily activities are scattered all over the place? I mean, my notes are being taken by hand, Apple Notes, GoodNotes, Evernote, and elsewhere. My files are stored locally, in OneDrive, Dropbox, Google Drive, and other various locations. 

My unorganized workflow got me to thinking. I have to get this cleaned up, and quickly! What happens if my computer crashes? If I need to recover a file, a picture, or merely critical data. Where is it located? Where is it backed up? Is it backed up? Do I have the software to restore? Is there a specific order on how I need to recover? Now, I do have a backup to the cloud, so I should be fine. Well, if that were the case, why would there be applications and services around backing up Dropbox, Drive, Box, etc. Just take a look at my friend’s most recent post, VirtuallyGeeky.

My debacle led me to think about the common problems in the data center. How does this correlate to what I see in the data center today? When I speak with customers about backup and recovery, they face a lot of the same problems, and they must go through the same pattern of thought. You have data being written in multiple locations, backed up in numerous different ways, and complete data center recovery is not an easy subject to navigate. And then there are some cases where the customer has no idea if a particular application or data is recoverable. What fuels the conversation is when asked, “What is your current RPO” or “What is your current RTO”?

Breaking down RPO and RTO.

TL;DR:
RTO is how long it will take you to return services to your end-users. RPO is how much data you can afford to lose. An example of RPO is; if you take a snapshot every 60-minutes with your last snapshot taken at 1:00 PM, you then you have an outage at 1:45 PM, you will have lost 45-minutes of data. Your RPO in this instance is one hour because at most you could lose up to 60-minutes of data.

RTO: Recovery Time Objective

“RTO refers to how much time an application can be down without causing significant damage to the business. Some applications can be down for days without significant consequences. Some high priority applications can only be down for a few seconds without incurring employee irritation, customer anger, and lost business. RTO is not simply the duration of time between loss and recovery. The objective also accounts for the steps IT must take to restore the application and its data. If IT has invested in failover services for high priority applications, then they can safely express RTO in seconds. (IT must still restore the on-premises environment. But since the application is processing in the cloud, IT can take the time it needs.)”

https://www.enterprisestorageforum.com/storage-management/rpo-and-rto-understanding-the-differences.html

RPO: Recovery Point Objective

“Recovery point objectives refer to your company’s loss tolerance: the amount of data that can be lost before significant harm to the business occurs. The objective is expressed as a time measurement from the loss event to the most recent preceding backup. If you back up all or most of your data in regularly scheduled 24-hour increments, then in the worst-case scenario, you will lose 24 hours’ worth of data. For some applications, this is acceptable. For others, it is not.”

https://www.enterprisestorageforum.com/storage-management/rpo-and-rto-understanding-the-differences.html

Let’s solve the problem!

Now, I’d be more than excited to blog on how I mastered my workflow problems, as that is a much simpler problem to solve. But, my focus here is on backup and recovery of applications and data amid an everchanging world of new workloads deployed to end-users. I struggled with these complexities personally and ultimately failed for the most part when I ran a large data center. That was a strong word to use, failed. I use that word because it is honest. Looking back at the number of hours it took to recover from partial failure leads me to believe that if a complete crash occurred, it would have been near impossible to restore all data and service in a timely fashion. Would everything eventually be restored? Absolutely. However, due to the lengthy time to recover, the business would suffer. The result? A failed plan for recovery, or RTO.

Are you struggling with these same issues? If so, it is not uncommon. You are not alone in this. I can say with confidence; many customers are not doing enough to ensure disaster recovery in a timely fashion. From my experience, the problem we face is that everything we do in the data center moves so quickly that it is challenging to keep up. Think about my previous blog, https://nutanixed.com/we-dont-do-cloud/, and how the complexity of the data center is prohibitive. If we do not get ahead of these complexities, we get buried in keeping the lights on. There is little or no time to focus on recovery. Are you adding people to your team to focus 100% of their time on disaster recovery, planning, documenting, and testing? Probably not. Am I right?

Now, I am not going to say that HCI in itself solves our problems. It just doesn’t. Yes, consolidating compute and storage does simplify the data center, but it does not provide any inherent value directly to disaster recovery. That does not mean that there is no benefit to HCI when it comes to disaster recovery. What it means is that other tools, or feature sets, need to go alongside HCI to reveal the benefits. The Nutanix platform takes into consideration data protection in its storage design. Not all HCI platforms do the same. Others require additional toolsets or a third party to accomplish what Nutanix can do natively. 

Let me pause and state that this blog has no focus on backup or long-term data recovery—the discussion is all about disaster recovery—the complete failure of datacenter A to datacenter B.

When considering how a customer is going to failover from one datacenter to another, we must answer the questions, “How fast do you need to recover after an event?” and “What do you need to recover quickly?” The answers are easy. We need to recover really fast! Seriously though. It will depend on the application and data. Mission-critical applications must be restored to service almost immediately, whereas non-critical applications may not need to recover for several hours, or not at all in some cases. What needs to be restored and how quickly needs to be identified early on in your recovery plan, allowing you to better design the restoration of services. Also, it will enable you to consider the cost of implementing the proper disaster recovery solution. 

We can spend countless hours reviewing VMware Site Recovery Manager, and vSphere replication. But, the truth is there is plenty of information available online. You can determine for yourself that it is overly complicated, requires a ton of time to deploy, and for most of us, we just do not have the time. You can read the 151-page installation guide. Scary right?
https://docs.vmware.com/en/Site-Recovery-Manager/8.2/srm-install-config-8-2.pdf

So, how do you leverage the power of recovery plans? Is there a better way? Yes, there is. Fast forward to Nutanix AOS and Nutanix Leap, designed with simplicity in mind! Let Nutanix solve this problem for you. Nutanix Leap will help you set up, configure, orchestrate, and automate all the disaster recovery services from our centralized management tool, Prism Central. Nutanix Leap includes support for both ESX and AHV! Now you can easily organize and implement your failover based on your RPO and RTO objectives in a few simple steps.

In a separate blog, we will cover extending these same capabilities to the public cloud, using our DRaaS offering, Xi Leap!

Let’s take a look at how we do it. For this, I am going to borrow some content from a friend of mine, Ahmed Badawy. So, if you like the content of this article, and are looking for guided step-by-step instructions to configure Nutanix Leap, please visit his blog site, CloudPlus360. Oh, and be sure to give him props!

Simple Steps to Deploying Leap on Nutanix

The first step is to deploy Prism Central. Now, you will need to implement Prism Central at the primary site and also the secondary or recovery site. (It is important to note that in a future release, you have the option to deploy Prism Central as a single instance, out-of-band.)

Deploying Prism Central

Next, we are going to enable Leap and Configure (Connect) Availablity Zones. But, before we begin, please make sure you have connectivity between the primary and secondary sites! A lot of times, this is the most complex part of the deployment.

Enabling Leap
Connect to Availability Zone

Once we enable Leap and configure the Availablity Zones, we will need to create the Protection Policy, again from Prism Central.

Creating a Protection Policy
Configuring a Protection Policy

And last, we need to create and configure the Recovery Plan.

Configuring a Recovery Plan

That is all there is to it. It is that simple. And this is what it looks like, conceptually.

Let’s close this out.

I spent a lot of time managing Tier 1 enterprise applications. Throughout my career, there was not one moment that I did not worry about a data center outage. Failing over from one data center to another was my ultimate fear. It was not until I stepped away from the complexities of other solutions and took a closer look at Nutanix that I learned how simple designing, planning, and testing, a complete disaster recovery solution could be. And when it comes a time, failover can be just as easy. You can sleep peacefully again!

Simply recover[ed]!

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *