One of the biggest challenges customers face is to make sure a Highly Available solution survives a catastrophic failure at fabric layer of Microsoft Azure, you things like servers, storage, network devices, power and cooling. Not caring about the fabric layer is one of the main reasons why organizations consider running their workloads in Azure in the first place.
However, Azure locations are not located at some magic castles that would make them invulnerable to catastrophic failures or other natural disasters. Of course, the magnitude of the disaster allows organizations to think about possible scenario’s to safeguard (more or less) the availability of their workloads. After all, Microsoft and their customers have a shared responsibility keeping the lot running.
Maintaining high availability at a single region provides two options:
- Availability Sets: allows workloads to be spread over multiple hosts, racks but still remain at the same data center;
- Availability Zones: allows workloads to be spread over multiple locations, so you automatically don’t care on which host the workload will run.
The following picture displays the difference between possible failures and SLA percentage. Obviously, Availability Zones offer higher protection against failures. Region pairs is beyond the scope of this post…
The beauty of both scenario’s is that the VNet required to connect an Azure VM is not bound by a single data center a.k.a. an Availability Zone. it is stretched over a whole region.
So I thought, let’s try this out with a typical workload that requires a high level of availability and can sustain failure pretty well. My choice was to host an SQL fail-over cluster (no Always On Availability Group) with additional resiliency using Storage Spaces Direct. Using all these techniques to maintain uptime, how cool is that?
I used the following guides to deploy a two node Windows Server 2016 cluster:
- https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-storage-spaces-direct Ignore the Hyper-V and network related stuff
- https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-portal-sql-create-failover-cluster This applies to the one with an Availabiltity Set only. Ignore the part when using an Availability Zone.
Actually I built two SQL S2D clusters. Both clusters were completely the same (Two DS11 VMs each with 2 P30 disks), except one was configured with an Availability Set and the other with an Availabilty Zone.
What makes the difference is the requirement for the Azure Load Balancer. You need an Azure Load Balancer for the cluster heartbeat to make sure which node is active. Looking the Azure Load Balancer overview, available at https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview you can see that you need a Standard SKU when using Availability Zones. When using an Availability Set, a basic SKU is sufficient. But that’s acutally it when deploying an SQL cluster using S2D. However, since the Load Balancer is an internal one anyway, I’d recommend using the Standard SKU anyway. From a pricing perspective, I don’t believe it would make much of a difference. If the penalties for downtime are much more severe, then I wouldn’t nitpick about this anyway.