Aug 19,2022

Azure VM resiliency and storage redundancy

Resiliency vs Redundancy

Both resiliency and redundancy are critical for highly available design and deployment of IT infrastructure. Despite the critical nature of both, resiliency and redundancy are not the same thing:

Redundancy is the duplication of critical system components in order to increase a system’s reliability.

Resilience is the ability of a system to recover from failures and continue to function.

This means that in order to achieve resiliency on a system we might need to include redundancy of components.

VM Resiliency

In a distributed system, failures can happen. Hardware can malfunction. The network can have transient failures. Rarely, an entire service or region may experience a disruption. It is important to plan for any failures through continuous monitoring and regular tests.

To help customers with their resiliency strategy, Azure offers three options for VM Resiliency:

  1. Single VM deployment
  2. Availability Sets
  3. Availability Zones

Single VM deployment

Starting with the first and most simple option, if using no redundancy of VMs, Microsoft Azure provides some resiliency to your VMs due to the Storage redundancy, that is discussed in greater detail in the latest part of this article under Storage Redundancy.

Availability Sets

When using Availability Sets, means that each VM we put on an availability set will be placed in different Fault Domains and Update Domains. So, for example, if we’re designing a system resiliency strategy, for a 3-tier application with high availability, we would consider placing the front-end servers, in an Availability Set, so those VMs will be assigned to different Fault Domains and Update Domains.

But what are Fault Domains and Update Domains?

Fault Domain define the group of virtual machines that share a common power source and network switch. 

Update Domain indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time

This permits that if a VM is down due to some hardware issue or update maintenance, the other(s) can still respond to any requests, resulting on a nice SLA of 99.95%.

Availability Zones

For greater resilience and availability, the solution is Availability Zones.

First we need to understand Azure architecture and that Azure is divided into Azure Regions and that inside of each Region we have Availability Zones interconnected by diverse fiber-optic paths, which are then composed by one or more datacenters, each equipped with independent power, cooling and networks.

This means, that, when we deploy machines into availability zones, those VMs will be placed in different zones in each region, providing resiliency against a full zone outage, and that translates into the 99.99% SLA.

Availability Zones

With availability zones, you can design and operate applications and databases that automatically transition between zones without interruption. Azure availability zones are highly available, fault tolerant, and more scalable than traditional single or multiple datacenter infrastructures.

Summary of Resiliency options:

VM Resiliency options

Storage Redundancy

Now that VM resiliency is covered, let’s have a look on Storage redundancy.

Microsoft Azure offers a wide range of options regarding storage redundancy, which means, that Azure will store multiple copies of your data depending on the option selected:

  • LRS: Local-Redundant Storage: 3 synchronous copies of data in the same physical location the primary region
  • ZRS: Zone-Redundant Storage: 3 synchronous copies of data across three different availability zones in the primary region
  • GRS: Geo-Redundant Storage: 3 synchronous copies of data in the same physical location in the primary region and an asynchronous copy to a secondary region that will have 3 synchronous copies on that secondary region in LRS.
  • GZRS: Geo-Zone-Redundant Storage: 3 synchronous copies of data across three different availability zones in the primary region and an asynchronous copy to a secondary region will have 3 synchronous copies on that secondary region in LRS.

There are also the Read-Access options for the secondary region with RA-GRS and RA-GZRS.

Considering that Microsoft recommends (and defaults to) using Managed Disks, which is recommended and has benefits such as integration with Availability Sets and Availability Zones and also provides an industry leading ZERO % annualized failure rate, we can only consider the LRS and ZRS, as GRS and GZRS are not supported for Managed Disks.

In case we have a single VM without Availability Sets or Availability Zone with managed disks on ZRS, and if one of the replicas will go down, the service will automatically recover our VM from a replica, however if the zone going down is the same where the Single VM is deployed we will have downtime. We can alternatively recover from a snapshot and build another VM or in the future Force the VM deletion and re-create the VM.

Check the following table for a quick comparison on the single VM scenario:

Single VM and Storage Redundancy LRS vs ZRS

So, when there is a deployment with Availability Zones, means that we also have VM redundancy, which will permit that even the case of a full zone being down, we can have the service running from the other(s) remaining VM(s) in the other zone(s).

Reliability for Azure VMs and Storage

Useful References

Resiliency in Azure | Microsoft Docs

Resilience in Azure Whitepaper | Microsoft Azure

Azure Resiliency Infographic_Final (microsoft.com)

Azure Disk Storage overview – Azure Virtual Machines | Microsoft Docs

Understanding Azure storage redundancy offerings – Microsoft Tech Community

Data redundancy – Azure Storage | Microsoft Docs

Storage Accounts and reliability – Microsoft Azure Well-Architected Framework | Microsoft Docs

SLA for Storage Accounts | Microsoft Azure

Redundancy options for Azure managed disks – Azure Virtual Machines | Microsoft Docs

Zone Redundant Storage (ZRS) option for Azure Disks for high availability – YouTube

Deploy a ZRS managed disk – Azure Virtual Machines | Microsoft Docs

Availability options for Azure Virtual Machines – Azure Virtual Machines | Microsoft Docs

High availability and disaster recovery for IaaS apps – Azure Architecture Center | Microsoft Docs

Architecting for resiliency and availability – Azure Architecture Center | Microsoft Docs

SLA for Virtual Machines | Microsoft Azure

VMs Resiliency checklist for services – Azure Architecture Center | Microsoft Docs

Storage Resiliency checklist for services – Azure Architecture Center | Microsoft Docs

John Savill’s Technical Training – Microsoft Azure Master Class Part 4 – Resiliency – YouTube

Get Started

Reach out to your PSAM or the Technical Presales & Deployment team at https://aka.ms/TPDNewRequest

Author

  • Microsoft Image
    Sr. Partner Technical Consultant - Security, Compliance & Identity

    Microsoft Partners are key for reaching our customers. I've been helping those partners to achieve more in the security & identity topics and respond to the ever-growing security challenges through cloud advisory consultations, technical presales, solution demonstrations, PoCs, design and architecture planning. As part of my continuous self-development I try to take on new challenges and keep updated on my area of expertise: Azure, Cloud Identity, Security & Compliance.