Why a disaster recovery strategy?
Ensuring business continuity is the main need of every organization and this is met by providing a response to a disaster or emergency that may affect information systems, however, there will always be variations in RTO and RPO, mainly which will be according to the business niche of each company, which is why it is important to be clear about the objectives for each of the organization's assets and based on these, it will be possible to define the best strategy to follow for backup and disaster recovery.
Ensuring business continuity is the main need of every organization and this is met by providing a response to a disaster or emergency that may affect information systems, however, there will always be variations in RTO and RPO, mainly which will be according to the business niche of each company, which is why it is important to be clear about the objectives for each of the organization's assets and based on these, it will be possible to define the best strategy to follow for backup and disaster recovery.
Establish objectives. Recovery Point Objective (RPO)
This is the maximum acceptable amount of time for data loss since the last recovery point. This objective determines what is considered acceptable data loss between the last recovery point and the service interruption and is defined by the organization.
RPO is the time between the last backup created and the time of the disaster. Once the company's RPO objective is defined, this will help us define the best backup strategy.
For critical systems, an RPO of 15 minutes is recommended for a good compromise between system load and processing time.
Recovery Time Objective (RTO)
This is the maximum acceptable delay between the interruption and the restoration of the service. This objective determines what is considered an acceptable window of time when the service is not available.

This is the phase in which all systems are recovered and the integrity of the system or data is verified, so that all critical systems can resume operation. One of the clearest examples of this stage can be verifying databases and records, making sure that the applications and/or services are running and are available.
This is one of the most important stages, since it depends on those responsible for the services to ensure that they are operating in optimal conditions. There are practices where it is recommended to automate the collection of services from the moment the server is turned on, since this speeds up and reduces validation and recovery times, and in turn reduces the time spent out of service.

Cloud Recovery Strategies
Once the RTO and RPO times have been defined, we proceed to determine the best DRP strategy, these strategies are listed below.
Backup & Restore
Pilot Light
Warm standby
Multi-site active/active

This strategy becomes the right approach to reduce data loss or corruption. In AWS, this strategy provides the possibility of recovering data in different regions in the event of a failure in the productive region.
For these cases, the recovery process consists of rebuilding the infrastructure, so it is important to have strategies that allow resources to be raised quickly and that can reduce the margin of error, some examples of which are: the use of the EC2 Lauch Template, which allows you to store the characteristics of the current service, or to make use of strategies such as Infrastructure as Code (IaC), since this allows you to launch all the resources at once, reducing the margin of human error.
Pilot Light
This approach allows data to be replicated from on premise, another cloud, or an AWS region to another AWS region, in the process, a copy of the infrastructure of the main workload is provisioned. In this DR strategy, the resources needed to replicate data are always active and resources, such as application servers and their configurations, are turned off until a recovery exercise is carried out or during failover in the face of a disaster, which is why this strategy allows for cost savings, since these are minimized since there are only resources in which the data is being replicated.
Warm standby
This DR strategy ensures that there is always a reduced copy of the productive environment, which means that the environment will be able to provide a response to requests but with reduced capacity, that is, it will not be able to handle the same traffic as in production, however, when a disaster compared to the two previous strategies, recovery times are reduced since a switch is made to this environment and it will only be necessary to increase the size of the resources or implement an auto strategy scaling so that the necessary resources are automatically provisioned according to demand.
Multi-site active/active
This DRP strategy is very similar to Warm Standby, since a copy of the productive environment must be kept, the difference with the previous one is that in this strategy the copy must have the same capabilities as in production, in this case, for this strategy the main recommendation is to link the environments with a domain administrator, in the case of AWS we are talking about Route 53, which will allow traffic to be redirected almost immediately to the DR environment, it is important to mention that the closer the solution gets to the point zero-recovery will be more expensive.
Contact us for any questions or comments you may have.
Source: Disaster recovery options in the cloud https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html
Published: 11/4/2024
Author: Daniela Blanco | Solution Architect