DRP: what is it and how to technically implement it?

Written by Emma Dixneuf | 12-Jan-2023 13:06:12

What is a DRP and what are the considerations to have in mind before defining it?

DRP stands for Disaster Recovery Plan. It is basically a plan of technical and organizational processes. It ensures the recovery of the information technology services after a catastrophe such as fire, hardware failure, etc. It is a specific scenario of a Business Continuity Plan, which deals with business operations running in general.

Before defining the plan in detail, there are some considerations that can be useful to determine the scope and needs of a DRP.

Scope and priorities of the DRP

To help define the scope and priorities of a DRP, the following questions can be answered:

What is the impact on the business if my application/IT system is down?
Which services are critical for business, and which are only nice to have?
Is the entire scope of service necessary for business continuity, or can a degraded mode be sufficient?

RTO and RPO

RTO means Recovery Time Objective, which is the maximum time during which services can be down without causing significant damage to the business
RPO means Recovery Point Objective, which is the acceptable quantity of data that can be lost due to a disaster without causing significant harm to the business.

Discussing these considerations will be a great help to define a DRP, as they always have to be kept in mind when deciding between different solutions, with different implementation and usage costs.

An overview of possible technical solutions for a DRP

There are different types of infrastructure configurations that can be identified for DRP. Here are the top three to keep in mind:

From Scratch

The infrastructure is rebuilt from scratch in case of disaster, using infrastructure as code, pipelines, scripts, or manual configuration. This is the less expensive solution as you do not have to run more instances. However, you have to consider the cost of the time necessary for rebuilding the infra.

Sometimes, depending on the components rebuilt, the delay can also be quite long (for example, a database instance might take several hours to be provisioned).

Active / passive

Some of the infrastructure is replicated but does not get traffic or is shut down if possible. In case of disaster, the passive environment is turned active, with leftover configuration (manual or scripted). This reduces the deploying time but can increase the cost as there are more machines deployed.

Active / active

The infrastructure is replicated in real-time and can serve traffic in a different region. This ensures failover but also reduces latency between different regions. Manual/Automatic operations for failover are significantly reduced. However, the cost is multiplied as there are several machines running at the same time.

Depending on the availability needs of the components (that should be listed during the RTO/RPO definition), you can use one of the different solutions for your DRP. Sometimes, one might not be accessible depending on the component you target, or there might be too much rework necessary on your infrastructure to consider it a viable solution.

Considering this can also help during the decision process.

An example of infrastructure DRP on an azure infrastructure

Now, let’s focus on an example of azure infrastructure to draw a sketch of a DRP. We will suppose that the scope has already been defined for the most critical application for business.

Here are the considerations taken into account for this DRP, and the decisions made according to them:

This application can afford some downtime (1 hour is acceptable)
- This implies that it would be all right for the app service to be built “from scratch”, as it is stateless.
The data need to be consistent; we can not afford data loss as this is business critical.
- This implies that the databases need to be replicated all the time, we can not have only scheduled backups as a recovery plan.

With the given consideration, and with azure solutions available, here is what could be done:

The app service will be built from scratch in the new region in case of a disaster
The databases are replicated
The storage account and keyvault are Geo-redundant storages, therefore can not (do not need to) be replicated. The frontdoor is global.
To ensure traffic to the backend, the API Management component needs to be replicated, as well as the Azure Container Registry to be able to pull the image when rebuilding the app service

This is an example that illustrates the different possibilities for a DRP, according to the business needs. Of course, depending on the constraints, it should be adjusted.

Conclusion

We have seen an overview of what a DRP is and how it can be implemented on an azure simple infrastructure. I would like to highlight the following point to keep in mind when thinking of a DRP:

the solution has to match the business needs, if the business app does not need to be 99.99% available, let’s not waste energy on ensuring that,
before digging deeper into a bulletproof solution that can take a lot of time and effort to implement, ask if this is worth it (in terms of time, energy, and cost VS the impact of the business being down),
prioritize the scopes, as not everything can be done at the same time (and that is ok),
when building an infrastructure, try to anticipate the eventuality of a DRP, because some early choices can have a late impact when elaborating the recovery plan.

Building a Disaster Recovery Plan can be long and tedious, so it is important to always stay focused on the objectives and scopes defined beforehand 😉

View full post