It is an interesting point to discuss. I am taking example of Azure here but it is applicable to other Public Clouds as well.
Azure Site Recovery is great native tool which helps us enable disaster recovery (DR) by replicating VMs to another region with few clicks. Microsoft allows you to turn the VM ON during the disaster recovery or whenever you want to. It helps you saving the running cost of VMs for the DR set up. However, Can Microsoft Turn all the VM On in the secondary region if a region fails? How many of you thought about that scenario?
My concerns around this grew more and more last year during the early covid19 period when utilization peaked to a new height. There were lot cases reported that organizations were unable to create new VMs as Microsoft data centers including Azure region were running out of resources due sudden usage spike across the world. What would happen if thousands of customers in a region wanted to start their VMs in their secondary Azure region which results starting lakhs VMs on the same day.
Can we assume that Microsoft reserves so much of resources in every region waiting for DR? No, I do not think so. We probably need a deep dive in how Microsoft perform a DR to their regions. Frankly speaking, I have not seen a Microsoft documentation around this in my google searches. So, what is it going to happen with region failure?
Here is what my thoughts are, these are my personal opinions and do not represent Microsoft or any companies.
Microsoft will never be doing a full disaster recovery of a region. Really? Yes, I think so. It would not require unless there is catastrophic disaster in a region like earthquake which perishes all the Azure data centers in the region. It is rarely going to happen because these considerations would have already made during the data center location selections. Having multiple datacenter each deployed in the same region far away up to 10 KM or more, it is unlikely that entire region would fail. These independent data centers are called Availability Zones. So I presume that you would experience a zone failure but not entire region failure. If happens ? That is still a million dollar question.
However, my above points do not negate the need for DR plan as part of BCP because we do not know what is really going to happen Tomorrow. So, we must have a DR plan as part of our BCP. But we must be careful in designing DR plan. My suggestion is to focus more on availability features in the Azure to cater data center failures using availability sets and Availability Zones. These Azure data centers (Zones) are not sharing the networking, grid power supply and cooling system etc. Microsoft is adding availability zone to most of its services.
I expect Microsoft will be coming up with a VM resource reservation feature which will help the enterprise to use that option to prioritize thier critical workloads and ensure that their VM availability in the DR region. This will ensure that Microsoft power on these VM with high priority during the disaster recovery. I think it make sense to pay additional price for highly critical VMs. This case is applicable to all other Public Cloud Providers for that matter including AWS CloudEndure.
Like I always say, we should transform our mind and thought process to think Cloud way when you start moving your workload to Cloud. We do not need to make a like to like design to match your data center in the Public Cloud.
Please send me your comments on this and I am happy to hear your thoughts.