<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" >

<channel><title><![CDATA[AZURE HEROES - Blog]]></title><link><![CDATA[https://www.azure-heros.com/blog]]></link><description><![CDATA[Blog]]></description><pubDate>Sun, 26 Apr 2026 04:23:33 +0300</pubDate><generator>Weebly</generator><item><title><![CDATA[Azure Private Link Gets Direct Connect]]></title><link><![CDATA[https://www.azure-heros.com/blog/azure-private-link-gets-direct-connect]]></link><comments><![CDATA[https://www.azure-heros.com/blog/azure-private-link-gets-direct-connect#comments]]></comments><pubDate>Tue, 14 Apr 2026 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/azure-private-link-gets-direct-connect</guid><description><![CDATA[Azure Direct Connect for Private Link Service&nbsp;it removes the mandatory Standard Load Balancer from the Private Link path entirely. Services can now be exposed directly from a backend Network Interface (NIC). That's a bigger deal than the feature announcement makes it sound.Why the Load Balancer Was There in the First PlacePrivate Link Service was designed to project a private service from one virtual network into another: across tenant boundaries, across subscriptions, even across overlappi [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><strong style="color:rgb(26, 26, 26)">Azure Direct Connect for Private Link Service</strong><span style="color:rgb(26, 26, 26); font-weight:400">&nbsp;it removes the mandatory Standard Load Balancer from the Private Link path entirely. Services can now be exposed directly from a backend Network Interface (NIC). That's a bigger deal than the feature announcement makes it sound.</span><br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/aznetditect_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div><div id="921223598388365434" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><div class="container"><h2>Why the Load Balancer Was There in the First Place</h2><p>Private Link Service was designed to project a private service from one virtual network into another: across tenant boundaries, across subscriptions, even across overlapping address spaces. The Standard Load Balancer acted as the mandatory frontend: Private Link needed a stable entry point to perform traffic steering and Source NAT (SNAT).</p><p>SNAT is the key mechanism that makes Private Link work across overlapping networks. When two networks share the same IP range (say, both use <code>10.0.0.0/16</code>), Azure rewrites the source address so traffic can flow without routing conflicts. The load balancer was the anchor for that process.</p><p>The problem is that for many workloads (especially smaller services, legacy apps, or SaaS platforms exposing individual tenant endpoints) the load balancer added real cost and complexity without providing meaningful value. You were paying for a component whose only job was to make Private Link technically possible.</p><h2>What Direct Connect Actually Changes</h2><p>With Direct Connect enabled, Azure's Software Defined Networking (SDN) layer injects traffic directly into the backend NIC. The NAT still happens. That's still how Azure solves the overlapping address problem. But it now happens at the NIC level rather than requiring an intermediate load balancer.</p><p>One thing worth understanding clearly: Private Link performs <strong>destination-side NAT</strong> on the provider side. The NAT IP address is what your backend service actually sees as the source of incoming packets &mdash; not the consumer's real IP. If your application needs the original consumer IP (for logging, rate limiting, or auditing), you can enable <strong>TCP Proxy v2</strong> on the Private Link Service. That adds a proxy protocol header carrying both the consumer's source IP and the Private Endpoint's link identifier, which your service can parse. Just make sure your application is configured to handle that header before enabling it &mdash; mismatched configuration causes request failures.</p><div class="arch-compare"><div class="arch-box old"><h4>&#10060; Traditional Private Link Path</h4><div class="arch-flow"><div class="node">Consumer VNet</div><div class="arrow">&darr;</div><div class="node">Private Endpoint</div><div class="arrow">&darr;</div><div class="node highlight">Standard Load Balancer</div><div class="arrow">&darr;</div><div class="node">Backend NIC</div><div class="arrow">&darr;</div><div class="node">Service</div></div></div><div class="arch-box new"><h4>&#9989; Direct Connect Path</h4><div class="arch-flow"><div class="node">Consumer VNet</div><div class="arrow">&darr;</div><div class="node">Private Endpoint</div><div class="arrow">&darr;</div><div class="node node-green">Backend NIC (direct)</div><div class="arrow">&darr;</div><div class="node">Service</div></div></div></div><div class="callout"><strong>Key specs</strong> Direct Connect supports up to <strong>10 Gbps throughput</strong> per static IP configuration and requires at least two static IP configurations (in multiples of two) for high availability. You can assign up to <strong>8 NAT IP addresses</strong> per Private Link Service &mdash; each additional NAT IP expands the port pool available for TCP connections, which is how you scale throughput under heavy load.</div><h2>Where This Actually Matters: Two Real Scenarios</h2><div class="scenario-card"><h3>Scenario 1: The Overlapping Network Problem</h3><p>This is the one that comes up constantly in enterprise work: two networks that need to communicate privately, but both use the same IP address space. VNet peering won't work. You can't route between them normally. The traditional workarounds (NAT appliances, VPN hairpins, renumbering one side) are all painful.</p><p>Private Link was already the cleanest answer to this problem because of SNAT. Direct Connect makes it even cleaner by removing the load balancer from the equation:</p><div class="ip-clash"><div class="ip-box">VNet A<br>10.0.0.0/16</div><div class="ip-arrow">&#8644;</div><div class="ip-box">VNet B<br>10.0.0.0/16</div></div><p>Azure rewrites the source address transparently so both sides can communicate without knowing about the overlap. This is especially useful in enterprise M&amp;A scenarios, multi-tenant SaaS platforms, and partner integrations where you have no control over the other side's address space.</p><div class="callout green"><strong>With Direct Connect</strong> Traffic flows from Private Endpoint &rarr; Backend NIC. No load balancer to provision, no frontend IP to manage, no extra cost tier to justify.</div></div><div class="scenario-card"><h3>Scenario 2: Application Gateway Integration</h3><p>Application Gateway can integrate with Private Link Service, which is a useful pattern for exposing web applications privately across tenant boundaries. But historically, this required a Standard Load Balancer sitting in front of the backend, even when Application Gateway itself was already handling traffic distribution.</p><p>Direct Connect removes that requirement. With a static private IP destination configured on the Private Link Service, Application Gateway connects directly to the backend resource. The architecture stays private end-to-end, and you've eliminated a component that wasn't adding value in that topology.</p><div class="callout warn"><strong>Note on static IPs</strong> Direct Connect requires a static private IP destination. Dynamic IP assignment is not supported in this mode. Plan your IP allocation before deployment.</div></div><h2>Implementation Checklist</h2><p>Before you deploy Direct Connect with Private Link Service, verify these requirements are met:</p><ul class="checklist"><li><strong>Static IP configuration:</strong> Define a static private destination IP for the service. Dynamic is not supported.</li><li><strong>High availability:</strong> Minimum of two static IP configurations required, in multiples of two.</li><li><strong>Subnet policy:</strong> Disable the <code>privateLinkServiceNetworkPolicies</code> setting on the subnet before deploying.</li><li><strong>Feature registration:</strong> Register the preview feature flag on your subscription before use.</li></ul><p>The feature registration command:</p><div class="code">az feature register --namespace Microsoft.Network --name AllowPrivateLinkserviceUDR</div><p>Verify registration status:</p><div class="code">az feature show --namespace Microsoft.Network --name AllowPrivateLinkserviceUDR --query properties.state</div><h2>Limitations Worth Knowing Before You Deploy</h2><p>The docs are clear on these, and they matter at design time rather than after the fact:</p><ul class="checklist"><li><strong>IPv4 only.</strong> IPv6 is not supported on Private Link Service, with or without Direct Connect.</li><li><strong>TCP and UDP only.</strong> Other IP protocols are not supported.</li><li><strong>NIC-based backend pools only.</strong> If your Standard Load Balancer backend pool is configured by IP address rather than by NIC, Private Link Service will not work. Direct Connect also targets NIC directly, so this constraint carries over.</li><li><strong>5-minute idle timeout.</strong> Private Link Service drops idle connections at approximately 300 seconds. Any application connecting through it should use TCP keepalives set below that threshold to avoid unexpected disconnects.</li><li><strong>No Basic Load Balancer support.</strong> Standard Load Balancer is required for the traditional path. Direct Connect bypasses the load balancer entirely, but your overall setup still needs to meet Standard tier requirements.</li></ul><div class="callout warn"><strong>Heads up on TCP Proxy v2</strong> If you enable TCP Proxy v2 on a Private Link Service, it activates across all load balancers and backend VMs sharing that configuration. If multiple Private Link Services share the same load balancer or backend pool, all of them need to be configured consistently &mdash; otherwise health probes will fail.</div><h2>My Take</h2><p>This is the kind of change that doesn't make headlines but quietly makes life better for a lot of architects. The Standard Load Balancer requirement for Private Link was never a design philosophy. It was an implementation constraint. Removing it is the right move.</p><p>The two scenarios where I'd reach for this immediately: overlapping-IP environments where renumbering isn't an option, and multi-tenant SaaS platforms where per-tenant load balancers were adding up on the monthly bill. For both of those, Direct Connect is a genuine architectural simplification, not just a cost tweak.</p><p>It's still public preview, so I wouldn't run critical production workloads through it just yet. But it's absolutely worth spinning up the lab, understanding the static IP requirements, and getting ahead of the feature before it reaches GA.</p><div class="sources"><strong>References</strong><br><a href="https://learn.microsoft.com/en-us/azure/private-link/private-link-service-overview" target="_blank" rel="noopener noreferrer">Azure Private Link Service - Microsoft Docs</a><br></div></div></div></div>]]></content:encoded></item><item><title><![CDATA[Azure Finally Has Native DNS Query Logging]]></title><link><![CDATA[https://www.azure-heros.com/blog/azure-finally-has-native-dns-query-logging]]></link><comments><![CDATA[https://www.azure-heros.com/blog/azure-finally-has-native-dns-query-logging#comments]]></comments><pubDate>Wed, 18 Mar 2026 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/azure-finally-has-native-dns-query-logging</guid><description><![CDATA[Azure DNS Security Policies&nbsp;hit general availability, bringing native DNS query logging to the platform. And while logging alone would have been enough to get my attention, they also shipped DNS query filtering (block, allow, alert) in the same package. Let me walk through how it all worksWhy This Took So Long to MatterThe core of Azure's built-in DNS resolution is the wire server at 168.63.129.16. Every virtual machine in Azure can use it by default to resolve Azure Private DNS zones and p [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><strong style="color:rgb(26, 26, 26)">Azure DNS Security Policies</strong><span style="color:rgb(26, 26, 26); font-weight:400"><span>&nbsp;</span>hit general availability, bringing native DNS query logging to the platform. And while logging alone would have been enough to get my attention, they also shipped DNS query filtering (block, allow, alert) in the same package. Let me walk through how it all works</span><br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/dnssecuritypo_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div><div id="584665749312294297" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><div class="container"><h2>Why This Took So Long to Matter</h2><p>The core of Azure's built-in DNS resolution is the wire server at <code>168.63.129.16</code>. Every virtual machine in Azure can use it by default to resolve Azure Private DNS zones and public names. It's reliable, it's fast, and for years it produced exactly zero DNS query logs. Nothing. That magic IP address had no logging capability at all.</p><p>Even the Azure Private DNS Resolver, when it was introduced, didn't solve this. It improved DNS architecture significantly but still left the logging gap wide open.</p><p>So customers with compliance requirements, security teams that needed DNS visibility, or architects who just wanted to know what their workloads were actually resolving ended up doing one of two things:</p><ul class="checklist"><li><span class="icon">1.</span><span><strong>Stand up a third-party DNS service</strong> (Infoblox, Bluecat, BIND, Windows DNS Server) and route all compute through it. This gave you logs but also gave you more VMs to manage, more licensing costs, and an architecture that didn't fit well for isolated workloads.</span></li><li><span class="icon">2.</span><span><strong>Use Azure Firewall's DNS proxy feature</strong>, which did include query logging. A reasonable option if you were already running Firewall, but a significant cost to justify just for DNS visibility.</span></li></ul><p>Both patterns were expensive ways to solve what should have been a platform-level feature. DNS Security Policies finally makes it one.</p><h2>What DNS Security Policies Actually Do</h2><p>There are two functions here, and they're worth separating clearly:</p><div class="component-grid"><div class="component-card"><h4>DNS Query Logging</h4><p>Every query processed through the policy gets captured: source IP, query name, record type, action taken. Send it to Log Analytics, a storage account, or Event Hub.</p></div><div class="component-card"><h4>DNS Query Filtering</h4><p>Block or allow queries based on domain lists. Blocked queries return a specific CNAME response instead of NXDOMAIN, making it clear what happened.</p></div><div class="component-card"><h4>Alert Action</h4><p>There's an "alert" action in the rule set. Honest note: in testing, it appears to behave the same as allow and log. Worth being aware of before you build logic around it.</p></div></div><h2>How the Resources Fit Together</h2><p>Before getting into configuration, it helps to understand how DNS Security Policies are structured. There are three resource types involved, and the Microsoft Learn docs use slightly different names from what the API actually calls them. Here's the quick translation:</p><div class="callout"><strong>Naming cheat sheet (Learn docs vs. API)</strong> DNS Security Policy = DNS Resolver Policy<br>DNS Traffic Rules = DNS Security Rules<br>Domain Lists = DNS Resolver Domain Lists</div><p>These three types relate to each other as follows: the DNS Security Policy is the parent. It has two types of children: <strong>DNS Traffic Rules</strong> (your filtering and logging logic) and <strong>Virtual Network Links</strong> (which VNets the policy applies to). Domain Lists are sibling resources to the policy itself, not children, which means you can reuse the same domain list across multiple policies.</p><h3>DNS Traffic Rules</h3><p>Rules are the engine of the policy. Each rule has a priority (100 to 65,000), an action (block, allow, or alert), and a reference to a domain list. A policy can have up to 10 rules. Rules are evaluated in priority order, lowest number first.</p><p>Here's what a typical layered rule set might look like:</p><div class="priority-chain"><div class="priority-row"><span class="priority-num">100</span> <span><strong>Malware domain list</strong></span> <span class="action-block">Block</span> <span class="priority-hint">Denies queries matching known bad domains</span></div><div class="priority-row"><span class="priority-num">200</span> <span><strong>Sensitive data exfil list</strong></span> <span class="action-alert">Alert</span> <span class="priority-hint">Logs the query (currently behaves like allow)</span></div><div class="priority-row"><span class="priority-num">65000</span> <span><strong>All traffic</strong></span> <span class="action-allow">Allow</span> <span class="priority-hint">Catches everything else, generates log entry</span></div></div><div class="callout warn"><strong>Priority ordering gotcha</strong> Rules are evaluated in order, and the first match wins. If you allow <code>contoso.com</code> at rule 100 and try to block <code>bad.contoso.com</code> at rule 200, the block will never fire. The rule 100 allow matches first because <code>bad.contoso.com</code> is a subdomain of <code>contoso.com</code>. Plan your rule ordering carefully.</div><h3>Domain Lists</h3><p>Domain lists are where you define what the rules match against. Each entry is either a full domain name or a wildcard, represented by a single period (which matches all subdomains). Because domain lists are siblings of the policy rather than children, you can maintain a central set of lists and reuse them across multiple policies in different environments.</p><h3>Virtual Network Links</h3><p>This is how you apply a policy to traffic. Each virtual network link ties one DNS Security Policy to one virtual network. The policy then intercepts queries that flow through that VNet's wire server.</p><p>Key constraints to know: each virtual network can only be linked to <strong>one</strong> DNS Security Policy. However, a single policy can be linked to multiple virtual networks, making centralized designs workable. And since these policies are regional resources, you'll need one per region in a multi-region setup.</p><h2>What Actually Gets Captured</h2><p>The policy processes queries that go through the wire server in the linked virtual network. Here's what was tested and what the results were:</p><table class="scenario-table"><thead><tr><th>Scenario</th><th>Captured?</th></tr></thead><tbody><tr><td>VM using wire server (168.63.129.16) in linked VNet</td><td><span class="tag yes">Yes</span></td></tr><tr><td>VM using Azure Private DNS Resolver in the same VNet</td><td><span class="tag yes">Yes</span></td></tr><tr><td>VM using a DNS proxy in front of Private DNS Resolver</td><td><span class="tag yes">Yes</span></td></tr><tr><td>A records and PTR records</td><td><span class="tag yes">Yes</span></td></tr><tr><td>AAAA records</td><td><span class="tag yes">Yes</span></td></tr><tr><td>TCP-based DNS queries (not just UDP)</td><td><span class="tag yes">Yes</span></td></tr><tr><td>Azure Bastion</td><td><span class="tag yes">Yes</span></td></tr><tr><td>Azure Firewall</td><td><span class="tag yes">Yes</span></td></tr><tr><td>VM pointing to an external DNS server (bypassing wire server)</td><td><span class="tag no">No</span></td></tr></tbody></table><p>That last row is important. If a machine is configured to use a DNS server outside the wire server (say, a private IP pointing to an on-prem resolver), its queries won't pass through the policy. The policy only catches what goes through Azure's built-in DNS resolution path.</p><h2>What the Logs and Blocks Actually Look Like</h2><p>When diagnostic logging is enabled on a DNS Security Policy, query events are written to a Log Analytics table named <code>DNSQueryLogs</code>. Each entry includes the source IP of the query, the domain name queried, the record type, and the action taken. The action field values are <code>Deny</code>, <code>Allow</code>, and <code>None</code> (which corresponds to the alert action).</p><p>When a query is blocked, the response the client receives is not an NXDOMAIN. Instead, it gets back a CNAME pointing to <code>blockpolicy.azuredns.invalid</code>. This is actually better behavior for troubleshooting. An NXDOMAIN looks like the domain doesn't exist. The <code>blockpolicy.azuredns.invalid</code> CNAME makes it immediately obvious that a DNS Security Policy is the reason the query failed, not a misconfigured DNS record or a missing zone.</p><h2>Where to Link Policies in Practice</h2><p>The policy intercepts queries at the wire server level in the linked virtual network. In a centralized DNS design where you're running an Azure Private DNS Resolver (or any DNS service) in a hub VNet, the right place to attach the policy is on that hub VNet. All spoke queries that route through the hub's DNS infrastructure will be captured by the policy.</p><p>For isolated workloads that don't connect to a shared DNS hub, you can link the policy directly to each isolated VNet. Since a policy can be linked to multiple VNets, one policy can cover an entire hub-and-spoke design while a separate policy handles isolated environments.</p><div class="callout green"><strong>One policy per region</strong> DNS Security Policies are regional resources. If you have workloads in multiple Azure regions, you'll need a separate policy in each region. You can reuse the same domain lists across all of them.</div><h2>My Take</h2><p>This is a feature that should have shipped years ago, and saying that isn't a criticism of the team that built it. DNS query visibility is table stakes for any security-conscious Azure deployment. The fact that it required either a third-party DNS server or Azure Firewall to achieve was a real cost and complexity burden on customers.</p><p>The filtering capability is a genuine bonus. Being able to block known-bad domains at the DNS level, before a connection is even attempted, is a lightweight and effective control. The domain list reuse model is well thought out for multi-environment deployments. The alert action being effectively the same as allow is a quirk to watch, but it's early days and GA means the foundations are solid.</p><p>If you're running Azure workloads with any compliance, security, or operational visibility requirements around DNS, this is worth setting up today. The days of spinning up BIND servers just to get DNS logs are over.</p><div class="sources"><strong>References</strong><br><a href="https://learn.microsoft.com/en-us/azure/dns/dns-security-policy" target="_blank" rel="noopener noreferrer">Azure DNS Security Policies - Microsoft Docs</a><br><a href="https://journeyofthegeek.com/2025/08/03/dns-in-microsoft-azure-dns-security-policies/" target="_blank" rel="noopener noreferrer">DNS in Microsoft Azure: DNS Security Policies - Journey of the Geek</a></div></div></div></div>]]></content:encoded></item><item><title><![CDATA[Azure Virtual Network Routing Appliance — A Native Solution for Hub-and-Spoke Routing]]></title><link><![CDATA[https://www.azure-heros.com/blog/azure-virtual-network-routing-appliance-a-native-solution-for-hub-and-spoke-routing]]></link><comments><![CDATA[https://www.azure-heros.com/blog/azure-virtual-network-routing-appliance-a-native-solution-for-hub-and-spoke-routing#comments]]></comments><pubDate>Sat, 14 Feb 2026 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/azure-virtual-network-routing-appliance-a-native-solution-for-hub-and-spoke-routing</guid><description><![CDATA[A few months back, a customer called me with a familiar frustration. They had a solid hub-and-spoke topology in Azure — around 100 spoke VNets, one per application team, clean isolation, good governance. Textbook setup. The problem? Traffic between their spokes had to pass through an Azure Firewall Premium they'd deployed in the hub, and they were starting to hit the 100 Gbps ceiling. On top of that, their monthly Azure Firewall bill had grown to a point where the finance team was asking quest [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><font color="#2A2A2A">A few months back, a customer called me with a familiar frustration. They had a solid hub-and-spoke topology in Azure &mdash; around 100 spoke VNets, one per application team, clean isolation, good governance. Textbook setup. The problem? Traffic between their spokes had to pass through an Azure Firewall Premium they'd deployed in the hub, and they were starting to hit the 100 Gbps ceiling. On top of that, their monthly Azure Firewall bill had grown to a point where the finance team was asking questions &mdash; and the honest answer was: *"We're mostly using it as a router, not a firewall.</font><br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/virtual-network-appliance-diagram_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div class="paragraph"><font color="#2A2A2A">That's an awkward conversation.<br><br>They didn't need deep packet inspection between internal spokes. They didn't need L7 filtering. They just needed traffic from Spoke A to reach Spoke B reliably and fast at scale. We looked at third-party NVAs, but the licensing costs, the VM-based throughput caps, and the 250,000 active connection limit made that feel like trading one problem for another.</font><br><br>Then Microsoft released<span>&nbsp;</span><strong>Azure Virtual Network Routing Appliance</strong><span>&nbsp;</span>into public preview in February 2026, and things got interesting.</div><div><div id="600769528200071820" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><article><h2>Why Spoke-to-Spoke Routing Is a Pain in the First Place</h2><p>If you've worked with Azure long enough, you know that VNet peering is not transitive. Spoke A and Spoke B are both peered to the Hub, but they can't talk to each other through the hub unless you explicitly tell them how.</p><p>This trips up a lot of engineers coming from traditional networking. On-premises, your router just&hellip; routes. In Azure, the "default gateway" of a subnet doesn't really exist as a routing entity. The virtual NIC itself holds the routing table and sends traffic directly to the destination &mdash; bypassing any gateway in between. So if you want Spoke A to reach Spoke B via the hub, you need an actual device sitting in the hub that Azure can use as a next-hop.</p><p>Historically, your options were:</p><table><thead><tr><th>Option</th><th>The Catch</th></tr></thead><tbody><tr><td><strong>Azure Firewall Premium</strong></td><td>Great firewall, awkward router. Max 100 Gbps, $1.75/hr before data charges, limited BGP support for learning on-prem default routes.</td></tr><tr><td><strong>Azure Firewall Standard</strong></td><td>Max 30 Gbps, $1.25/hr &mdash; even less suited for high-throughput routing.</td></tr><tr><td><strong>Third-party NVA</strong></td><td>VM-based = VM limits. The 250,000 active connection cap has caught more than a few people off guard at scale. Add licensing, patching, and support contracts on top.</td></tr></tbody></table><p>For organizations going cloud-first, there's often a genuine push to avoid third-party NVAs where a native Azure construct can do the job. Until now, there wasn't one.</p><hr><h2>Enter the Azure Virtual Network Routing Appliance</h2><p>AVNA is a managed Azure resource that you deploy directly inside your hub VNet. It runs on <strong>specialized networking hardware</strong> &mdash; not regular VMs &mdash; which is exactly why the performance numbers look so different from what you're used to.</p><p>Here's what makes it different from an NVA or Azure Firewall:</p><ul><li>It's a <strong>top-level Azure resource</strong> &mdash; managed just like a VNet, NSG, or Route Table. No OS to patch, no images to manage.</li><li>It lives in a <strong>dedicated subnet</strong> called <code>VirtualNetworkApplianceSubnet</code>.</li><li>It's <strong>purely a forwarding layer</strong> &mdash; it routes traffic, full stop. No firewall policy, no DPI, no NAT (though it works alongside NAT Gateway).</li><li><strong>High availability is built-in</strong> and it's availability zone resilient by default &mdash; you don't need a load balancer in front of it. If you put one there anyway, it won't work the way you expect.</li><li>Supports <strong>NSGs, Admin rules, UDRs, and NAT Gateway</strong> natively.</li></ul><hr><h2>The Architecture</h2><p>Here's how this looks in a typical hub-and-spoke setup. Each spoke VNet is peered to the hub. You configure UDRs in your spoke subnets to point east-west traffic at the AVNA's IP address as the next-hop. The AVNA handles the rest &mdash; forwarding the packet to the right destination spoke. On-premises traffic keeps going through your hub VPN/ExpressRoute gateway, and internet egress still goes through your NAT Gateway or firewall.</p></article></div></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/imagensg4_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><div id="389754780844848882" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml">Click to set custom HTML</div></div><div><div id="193008405506446380" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><p class="diagram-caption">Fig 1 &mdash; Hub-and-spoke topology with AVNA as the east-west routing layer</p><hr><h2>How You Set Up the Routing</h2><p>This is where it gets a bit architectural. Once AVNA is deployed, you need to decide how your spoke UDRs are structured. Three options:</p><div class="options-grid"><div class="option-card"><div class="label">Option 1 &mdash; Keep it granular</div><p>Create specific routes per spoke &mdash; cloud prefixes go to AVNA, on-premises traffic to the hub gateway, internet to egress. Maximum control, maximum route table management. Fine for small environments, painful at scale.</p></div><div class="option-card"><div class="label">Option 2 &mdash; Everything through AVNA</div><p>Point a default route (0.0.0.0/0) in the spokes to the AVNA and let it sort everything &mdash; spoke-to-spoke internally, on-prem to the gateway, internet egress onward. Simplest UDR config. Reduces asymmetric routing risk. Trade-off: you lose per-spoke granularity.</p></div><div class="option-card recommended"><div class="label">&#9989; Option 3 &mdash; RFC1918 via AVNA, default to egress (the sweet spot)</div><p>Send all private address space (10/8, 172.16/12, 192.168/16) to the AVNA. Let the default route (0.0.0.0/0) point to your internet egress solution. Cleanly separates spoke-to-spoke routing from internet egress, reduces accidental asymmetric firewall paths, and keeps spoke UDRs simple. Most practitioners land here.</p></div></div><hr><h2>Performance &mdash; This Is Where It Gets Serious</h2></div></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/imagensgee4_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><div id="462504448881156895" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><p class="diagram-caption">Fig 2 &mdash; AVNA bandwidth tiers and their capacity limits</p><p>Compare 200 Gbps and 8 million concurrent flows to what Azure Firewall Premium offers (100 Gbps) or what a typical NVA VM can do &mdash; limited by the VM SKU and that 250,000 connection cap. For high-traffic, east-west-heavy environments, this changes the conversation completely.</p><div class="callout"><strong>&#9888;&#65039; Heads-up:</strong> The capacity tier is selected at deployment time and <strong>cannot be changed later</strong> without a full redeployment. Size it properly upfront. During preview there's no charge, so you might as well pick 200 Gbps &mdash; but when billing kicks in at GA, you'll want to have done that math first.</div><hr><h2>What About NSGs?</h2><p>You can attach an NSG to the <code>VirtualNetworkApplianceSubnet</code> to enforce basic Layer 4 filtering between spokes. But be aware &mdash; because traffic is both entering and leaving through the same subnet, your rules need to handle both inbound and outbound directions. NSGs are a blunt tool here, and this setup is really suited to spokes in the <strong>same security zone</strong> where you just need some least-privilege access control. If you need stateful deep filtering, Azure Firewall or a third-party NVA is still the right answer for that layer.</p><hr><h2>Preview Limitations &mdash; Be Honest With Your Stakeholders</h2><table><thead><tr><th>What's limited</th><th>Details</th></tr></thead><tbody><tr><td><strong>Not production-ready</strong></td><td>Preview only &mdash; for testing and evaluation</td></tr><tr><td><strong>Instances per subscription</strong></td><td>Max 2 (request more via form)</td></tr><tr><td><strong>Max throughput</strong></td><td>200 Gbps per instance</td></tr><tr><td><strong>IPv6</strong></td><td>Not supported</td></tr><tr><td><strong>Metrics and logs</strong></td><td>Not yet exposed &mdash; you're flying blind during preview</td></tr><tr><td><strong>Tooling</strong></td><td>No Azure CLI, PowerShell, or Terraform support yet</td></tr><tr><td><strong>Private Endpoint</strong></td><td>Global and cross-region Private Endpoint not supported</td></tr><tr><td><strong>Regions</strong></td><td>East US, East US 2, West Central US, West US, North Europe, UK South, West Europe, East Asia</td></tr><tr><td><strong>Cost</strong></td><td>Free during preview</td></tr></tbody></table><p>The lack of metrics is the one that stings the most in practice. You can't see traffic volumes, connection counts, or error rates. That's fine for a lab, not great if you're trying to build confidence for a GA migration plan.</p><hr><h2>How to Get Access</h2><ol><li>Register your subscription for the preview feature flag: <code>Microsoft.network/AllowVirtualNetworkAppliance</code></li><li>Fill out the sign-up form linked on the <a href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-routing-appliance-overview" target="_blank" rel="noopener noreferrer">Microsoft Learn page</a></li><li>Wait for product group approval</li></ol><p>It's technically a public preview, but the approval gate and limited region availability make it feel closer to a private preview. Expect some wait time.</p><hr><h2>My Take</h2><p>Going back to my customer &mdash; we're watching this closely. The numbers look right, the architecture fits, and the idea of removing a $1.75/hr firewall that was doing routing work and replacing it with a native Azure construct is exactly the kind of simplification their platform team has been pushing for.</p><p>What I'd want to see before recommending this for production:</p><ul><li><strong>Metrics and logging</strong> &mdash; you need visibility into what's flowing through it</li><li><strong>Terraform and CLI support</strong> &mdash; nobody wants to manage infrastructure only through the portal</li><li><strong>Clear GA pricing</strong> &mdash; the capacity tiers strongly suggest tiered billing; that math needs to make sense vs. alternatives</li><li><strong>IPv6 support</strong> &mdash; more and more environments are running dual-stack</li></ul><p>If you're in a hub-and-spoke topology and your routing solution feels like a workaround, this is worth signing up for and testing. Get familiar with it now so you're ready when it hits GA.</p><div class="sources"><strong>Sources:</strong><br><a href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-routing-appliance-overview" target="_blank" rel="noopener noreferrer">Microsoft Learn &mdash; Virtual Network Routing Appliance Overview</a><br><a href="https://www.simonpainter.com/azure-virtual-network-appliance/" target="_blank" rel="noopener noreferrer">Simon Painter (Azure MVP) &mdash; Public preview of Azure Virtual Network Routing Appliance</a></div></div></div>]]></content:encoded></item><item><title><![CDATA[Stop Waiting for Incidents: A Practical Guide to Failure Mode Analysis on Azure]]></title><link><![CDATA[https://www.azure-heros.com/blog/stop-waiting-for-incidents-a-practical-guide-to-failure-mode-analysis-on-azure]]></link><comments><![CDATA[https://www.azure-heros.com/blog/stop-waiting-for-incidents-a-practical-guide-to-failure-mode-analysis-on-azure#comments]]></comments><pubDate>Tue, 03 Jun 2025 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/stop-waiting-for-incidents-a-practical-guide-to-failure-mode-analysis-on-azure</guid><description><![CDATA[Failure Mode Analysis (FMA) is one of the most underused reliability practices in Azure architecture work. It is not about predicting every possible failure. It is about working through "what happens when this breaks?" before it breaks, so the answers are already designed into the architecture rather than improvised during an incident.The Azure Well-Architected Framework treats FMA as a baseline requirement under the Reliability pillar (RE:03), not an advanced topic. The premise is simple: failu [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">Failure Mode Analysis (FMA) is one of the most underused reliability practices in Azure architecture work. It is not about predicting every possible failure. It is about working through "what happens when this breaks?" before it breaks, so the answers are already designed into the architecture rather than improvised during an incident.<br><br><span></span></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/failure-mode-analysis_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div><div id="861852271665322875" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><div class="container"><p>The Azure Well-Architected Framework treats FMA as a baseline requirement under the Reliability pillar (RE:03), not an advanced topic. The premise is simple: failures happen regardless of how resilient your design appears. More complex environments are exposed to more failure types. FMA gives you a structured way to map those failure points across your critical flows, understand the blast radius of each one, and make deliberate decisions about mitigation before those decisions get made for you by an outage.</p><p>This post is a detailed walkthrough of how to do it: the five-step process, the eight failure mode categories you need to work through for every component, the dependency decisions that shape your entire architecture, and the artifact you produce at the end.</p><h2>What FMA Actually Is (and What It Is Not)</h2><p>FMA is a structured process for identifying potential failure points in your workload's critical flows and planning specific mitigation strategies for each one. It is not a risk register filled with unlikely scenarios. It is not a checklist you run once before go-live. And it is definitely not the same as chaos engineering, though the two work well together.</p><p>The Azure WAF defines a failure as an unexpected event that prevents a component from continuing to function normally. A hardware malfunction causing a network partition is a failure. A misconfigured routing rule that drops 30% of requests is a failure. These are different from errors, which are expected parts of normal operations that the application handles through business logic - input validation failures, transient HTTP 429s, null checks. The distinction matters because failures require architectural decisions; errors require code.</p><div class="callout"><strong>The core premise of FMA</strong> Failures happen regardless of how many layers of resiliency you apply. More complex environments are exposed to more types of failures. FMA does not assume you can prevent all failures. It assumes you need to know, in advance, which failures break which flows, what the blast radius looks like, and what you plan to do about it.</div><p>The output of FMA is a set of documented decisions: which failure modes you have designed mitigations for, which ones you have accepted as low-probability risks not worth the mitigation cost, and which flows remain vulnerable because the mitigation was too expensive or complex to justify at launch. Making those decisions deliberately is the point.</p><h2>The Five-Step FMA Process</h2><div class="step-flow"><div class="step"><div class="step-num">1</div><div class="step-body"><strong>Identify and prioritize your critical flows</strong><p>FMA is flow-centric, not component-centric. Before touching any component, you need a clear map of the user flows and system flows your workload supports, ranked by criticality. A user sign-in flow in a SaaS product is typically more critical than the monthly invoice generation flow. The criticality of the flow determines how much investment the mitigation deserves. This step is assumed to exist before FMA begins - if you have not done flow mapping, start there.</p></div></div><div class="step"><div class="step-num">2</div><div class="step-body"><strong>Decompose the workload into component types</strong><p>For each flow, identify the discrete components it touches. Typically these fall into: ingress control, networking, compute, data and storage, supporting services (authentication, messaging, key management), and egress control. At the design stage you may not know exact services yet - that is fine. The goal is to produce a component map that each flow can be traced through step by step.</p></div></div><div class="step"><div class="step-num">3</div><div class="step-body"><strong>Identify and classify dependencies</strong><p>Once you have a component map, identify every dependency each component has - both internal (within your workload scope, like an internal API or Azure Key Vault) and external (outside your scope, like Microsoft Entra ID or Azure ExpressRoute). For each dependency, capture its reliability data: availability SLA, scaling limits, and whether it has documented failover behavior. This is the step that surfaces hidden single points of failure.</p></div></div><div class="step"><div class="step-num">4</div><div class="step-body"><strong>Evaluate failure modes and blast radius for each component</strong><p>Working through each flow step by step, evaluate how each component and its dependencies could be affected by each class of failure. Document what breaks, what degrades, and what continues working when each failure mode hits. Critically: analyze read failures and write failures separately. A database that can still accept reads during a storage issue has a different blast radius than one that is fully offline. The same component can be affected by multiple failure modes simultaneously.</p></div></div><div class="step"><div class="step-num">5</div><div class="step-body"><strong>Plan mitigation and design detection</strong><p>For each failure mode you have identified and chosen to address, define your mitigation strategy: more resiliency (redundancy, zone distribution, regional failover) or graceful degradation (rerouting flows, disabling non-critical features, serving cached responses). Then define how you detect the failure: which metric breaches the threshold, which alert fires, and what the on-call process looks like. Mitigation without detection is a plan that never executes.</p></div></div></div><h2>The Failure Mode Catalog</h2><p>The WAF identifies eight failure mode categories that every component should be evaluated against. Understanding the blast radius profile of each one helps you prioritize which mitigations are worth the cost.</p><div class="failure-grid"><div class="failure-card high"><h4>Regional Outage</h4><p>An entire Azure region becomes unavailable. Typically requires cross-region architecture (active-active or active-passive) to survive. This is the most expensive mitigation to build and the least likely event. For most workloads, the right call is to document the exposure and accept it unless the SLA demands otherwise.</p></div><div class="failure-card high"><h4>Availability Zone Outage</h4><p>One AZ within a region goes down. Zone-redundant deployment is the standard mitigation across compute, data, and networking. This is lower cost than cross-region and covers the more realistic failure scenario. If your services are not zone-redundant, this row in your FMA document should have a clear mitigation plan or an accepted risk decision.</p></div><div class="failure-card med"><h4>Service Outage</h4><p>One or more Azure services become unavailable. The only mitigation is redundancy at the same or alternate tier, or graceful degradation if the service is not on the critical path for all flows. Document the RTO/RPO impact and whether your monitoring catches this before your customers do.</p></div><div class="failure-card med"><h4>DDoS or Malicious Attack</h4><p>Layer 3/4 DDoS is handled by Azure infrastructure. Layer 7 attacks are your responsibility. Azure Front Door with Azure Web Application Firewall handles most of this, but the WAF policy configuration, rate limiting rules, and bot protection settings need to be explicitly validated in your FMA. Do not assume protection exists without verifying the configuration.</p></div><div class="failure-card med"><h4>Misconfiguration</h4><p>One of the more likely and more avoidable failure modes. A routing rule change, a RBAC assignment, a certificate rotation, a firewall rule update - any of these can create an outage. Mitigation is infrastructure-as-code with automated validation, deployment gating, and rollback capability. For externally managed configs, define a process for catching them before they reach production.</p></div><div class="failure-card med"><h4>Operator Error</h4><p>Human mistakes during operations, maintenance, or incident response. Mitigation includes Privileged Identity Management with just-in-time access, RBAC scoped to minimum required permissions, change management processes, and runbooks that prevent common errors. Also consider: what does your runbook tell an on-call engineer to do if they misread the alert?</p></div><div class="failure-card low"><h4>Planned Maintenance Outage</h4><p>Known, scheduled maintenance windows that require downtime. Azure maintenance windows exist for some services. For your own components, the mitigation is blue-green deployments, rolling updates, and zero-downtime release pipelines. This failure mode is fully preventable with the right deployment architecture.</p></div><div class="failure-card low"><h4>Component Overload</h4><p>A component hits a scaling limit or resource ceiling under load. Mitigation includes autoscaling configurations, load testing to identify limits before production, circuit breakers in application code, and throttling policies on downstream dependencies. Overload failures are often cascading - one slow component causes timeouts that pile up and overwhelm others.</p></div></div><div class="callout warn"><strong>Analyze read and write failures separately</strong> A data service that can still serve reads during a storage issue has a very different impact profile than one that is fully offline. Some flows need write capability (checkout, form submission, state updates); others only need reads (search, dashboards, content display). Breaking them apart in your analysis often reveals that the blast radius is smaller than you assumed - or larger.</div><h2>Strong vs. Weak Dependencies: A Decision That Shapes Your Entire Architecture</h2><p>Once you have cataloged your dependencies, you need to classify each one as either strong or weak. This is not just taxonomy - it drives the mitigation budget for that dependency and determines whether its SLA needs to match yours.</p><div class="dep-grid"><div class="dep-card strong"><h3>Strong Dependencies</h3><p>Components that are required for the workload to function at all. If the dependency is absent or degraded, the flow breaks or the workload is unavailable.</p><ul><li>Microsoft Entra ID for authentication flows</li><li>Azure SQL for transaction processing</li><li>Azure Key Vault for secret access at startup</li><li>Internal APIs that every user-facing call passes through</li></ul><p class="dep-implication"><strong>Implication:</strong> The availability and recovery targets of strong dependencies must align with the targets of the workload itself. If your workload SLA is 99.9%, a strong dependency at 99.5% is your ceiling, not your floor.</p></div><div class="dep-card weak"><h3>Weak Dependencies</h3><p>Components whose absence degrades specific features but does not break core flows or make the workload unavailable.</p><ul><li>A recommendation engine - products can still be purchased without it</li><li>An analytics event pipeline - the transaction completes; the event is lost</li><li>A third-party enrichment API - the data shows with defaults if it times out</li><li>Non-critical notification systems</li></ul><p class="dep-implication"><strong>Implication:</strong> Weak dependencies should be wrapped with timeouts, circuit breakers, and graceful fallback behavior. Minimize coupling so that a failure in a weak dependency cannot cascade into a strong dependency.</p></div></div><p>The classification of a dependency often changes across flows. Microsoft Entra ID is a strong dependency for the sign-in flow and a weak dependency for anonymous product search. Document it per flow, not per component globally.</p><div class="callout warn"><strong>Watch for accidental strong dependencies</strong> A weak dependency that is not given a timeout becomes a strong dependency under load. If your application synchronously waits indefinitely for a recommendation engine that is down, the recommendation engine is now a strong dependency in practice, regardless of what your design doc says. Thread exhaustion and connection pool depletion are the typical cascade paths. Circuit breakers are not optional for anything that can be slow.</div><h2>The FMA Document: What You Actually Produce</h2><p>The artifact from FMA is a table that captures each component, the failure mode being analyzed, the likelihood of that failure, the effect on each flow, the mitigation in place, and the outage classification. This document is living - it starts as theoretical planning and gets refined through chaos testing and real incidents over time.</p><p>The following is an example based on the e-commerce architecture described in the Azure WAF documentation: an application running on Azure App Service with Azure SQL databases, fronted by Azure Front Door, and using Microsoft Entra ID for authentication.</p><table class="fma-table"><thead><tr><th>Component</th><th>Failure Mode</th><th>Likelihood</th><th>Effect and Mitigation</th><th>Outage Scope</th></tr></thead><tbody><tr><td>Microsoft Entra ID</td><td>Service outage</td><td><span class="likelihood-low">Low</span></td><td>Full workload outage for authenticated users. No mitigation other than Microsoft remediation. Document RTO expectation against Entra SLA.</td><td><span class="badge full">Full</span></td></tr><tr><td>Microsoft Entra ID</td><td>Misconfiguration</td><td><span class="likelihood-med">Medium</span></td><td>Users unable to sign in. No downstream data effect. Application catches auth exceptions and surfaces a clear error. Help desk escalation triggers development team review.</td><td><span class="badge partial">External only</span></td></tr><tr><td>Azure Front Door</td><td>Service outage</td><td><span class="likelihood-low">Low</span></td><td>Full outage for external users. No internal bypass. Dependent on Microsoft to remediate. Ensure Azure Service Health alerts are configured to fire on AFD degradation.</td><td><span class="badge partial">External only</span></td></tr><tr><td>Azure Front Door</td><td>Regional outage</td><td><span class="likelihood-low">Very low</span></td><td>Minimal effect. AFD is a global service; traffic routing automatically shifts to non-affected regions. No mitigation action required from the workload team.</td><td><span class="badge none">None</span></td></tr><tr><td>Azure Front Door</td><td>DDoS attack (L7)</td><td><span class="likelihood-med">Medium</span></td><td>L3/L4 DDoS managed by Microsoft. L7 attacks mitigated by WAF policy - rate limiting rules, bot protection, and custom rules are configured and tested. Potential for brief degradation under a sophisticated L7 attack if WAF rules are not current.</td><td><span class="badge partial">Potential partial</span></td></tr><tr><td>Azure SQL</td><td>Service outage</td><td><span class="likelihood-low">Low</span></td><td>Full workload outage for all transactional flows. Read-only flows may survive if a read replica is configured. Dependent on Microsoft to remediate.</td><td><span class="badge full">Full</span></td></tr><tr><td>Azure SQL</td><td>Regional outage</td><td><span class="likelihood-low">Very low</span></td><td>Auto-failover group configured to secondary region. Expected brief outage during failover. RTO and RPO to be validated through controlled failover testing. Failover process is automated; manual intervention not required.</td><td><span class="badge partial">Potential full</span></td></tr><tr><td>Azure SQL</td><td>Availability zone outage</td><td><span class="likelihood-low">Low</span></td><td>No effect. Zone-redundant configuration active. Automatic failover within the region. No mitigation action required.</td><td><span class="badge none">None</span></td></tr><tr><td>App Service</td><td>Regional outage</td><td><span class="likelihood-low">Very low</span></td><td>Minimal effect. Azure Front Door routes traffic to instances in non-affected regions. Latency increase for users in the affected region. No data loss expected if the SQL failover is completed within RPO window.</td><td><span class="badge none">None</span></td></tr><tr><td>App Service</td><td>Component overload</td><td><span class="likelihood-med">Medium</span></td><td>Autoscale configured with scale-out rules triggered on CPU and request queue depth. Load testing validated that scale-out completes within 3 minutes at 2x peak load. Circuit breakers in application code prevent SQL connection pool exhaustion during overload.</td><td><span class="badge partial">Potential partial</span></td></tr></tbody></table><div class="callout"><strong>Prioritize before you document everything</strong> A complete FMA for a non-trivial workload generates a large table. Before investing in mitigation planning for every row, rank by severity and likelihood. Multi-region outages warrant documenting and accepting as low-probability risk. Misconfiguration and operator error, which are medium-likelihood and fully preventable, deserve more attention than regional outage scenarios in most commercial workloads.</div><h2>Azure Tooling That Supports FMA Work</h2><p>FMA is a design-time practice, but the tooling that makes it useful spans design, testing, and ongoing operations.</p><ul class="tool-list"><li><span class="tool-icon">&#9633;</span> <span><strong>Azure Monitor and Log Analytics.</strong> The foundation for failure detection in production. Every mitigation you design needs a corresponding alert. If you cannot detect the failure mode in your FMA table, your mitigation plan has a gap. Azure Monitor also surfaces the historical data you need to validate the likelihood assessments in your FMA document over time.</span></li><li><span class="tool-icon">&#9633;</span> <span><strong>Application Insights, Container Insights, VM Insights, SQL Insights.</strong> Workload-level observability that goes deeper than infrastructure metrics. Application Insights in particular surfaces dependency call failures, slow response times, and exception patterns that are invisible at the infrastructure layer - exactly the signals that confirm or refute your FMA assumptions.</span></li><li><span class="tool-icon">&#9633;</span> <span><strong>Azure Network Watcher (Connection Monitor and Traffic Analytics).</strong> Use Connection Monitor before deployment to validate network connectivity assumptions in your FMA. Traffic Analytics surfaces historical flow data that reveals blocked or anomalous traffic patterns - evidence that a failure mode you documented as theoretical has been occurring in practice.</span></li><li><span class="tool-icon">&#9633;</span> <span><strong>Azure Chaos Studio.</strong> The tool that converts FMA from theoretical planning into validated reality. Chaos Studio lets you inject specific failure conditions - zone outages, network latency, service unavailability - into a controlled environment to verify that your mitigations actually work. Run chaos experiments against the failure modes in your FMA table, starting with the highest-severity, highest-likelihood rows. The gaps between what you planned and what Chaos Studio reveals are the items that need architectural rework.</span></li></ul><blockquote>A solid FMA practice does not just protect your workload. It protects your customers, your reputation, and your business continuity targets.</blockquote><h2>FMA Is Not a One-Time Exercise</h2><p>The most common mistake with FMA is treating it as a gate that gets cleared before go-live and then filed away. The document becomes stale the moment the architecture changes, which in active workloads happens continuously.</p><p>FMA should be revisited whenever a significant architectural change is made - new service added, new region introduced, new external dependency onboarded. It should be reviewed after every incident to check whether the failure mode was already in the document (and whether the mitigation held) or whether it was a gap that the document needs to cover. Chaos Studio experiments scheduled on a regular cadence turn the FMA from a plan into a continuously validated commitment.</p><div class="callout green"><strong>Starting point if you have never done FMA before</strong> Pick your single most critical user flow. Map every component it touches. For each component, work through the eight failure modes and ask: what breaks, who notices first, and what is the current mitigation? Document what you find. You will identify at least one single point of failure that was not on anyone's radar. That finding alone justifies the exercise - and it gives you the concrete case for investing in the broader FMA practice.</div><h2>My Take</h2><p>FMA is one of those practices that feels like overhead until the moment it pays off. At that point, it pays off enormously - because the team is not inventing answers during an incident. The blast radius is already understood. The mitigation is already in place. The detection fired before a customer did. That is the difference between a war room and a 15-minute operations call.</p><p>The WAF framing is correct: failures happen regardless of how resilient the system appears. What FMA gives you is the ability to decide in advance which failures you have designed for, which ones you have accepted as tolerable risks, and which flows are allowed to degrade gracefully versus which ones must stay fully operational at all costs. That is not busywork. That is architecture.</p><p>The teams that skip it are the ones running the post-incident meeting where someone says "we should have caught this." The teams that do it well skip that meeting entirely.</p><div class="sources"><strong>References</strong><br><a href="https://learn.microsoft.com/en-us/azure/well-architected/reliability/failure-mode-analysis" target="_blank" rel="noopener noreferrer">Architecture strategies for performing failure mode analysis - Azure WAF (Updated January 2026)</a><br><a href="https://learn.microsoft.com/en-us/azure/well-architected/reliability/" target="_blank" rel="noopener noreferrer">Azure Well-Architected Framework: Reliability pillar - Microsoft Learn</a><br><a href="https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-overview" target="_blank" rel="noopener noreferrer">Azure Chaos Studio overview - Microsoft Learn</a><br><a href="https://learn.microsoft.com/en-us/azure/network-watcher/network-watcher-monitoring-overview" target="_blank" rel="noopener noreferrer">Azure Network Watcher overview - Microsoft Learn</a></div></div></div></div>]]></content:encoded></item><item><title><![CDATA[Data Residency Is Not Data Sovereignty: The Real Story Behind Cloud Control in the EU]]></title><link><![CDATA[https://www.azure-heros.com/blog/data-residency-is-not-data-sovereignty-the-real-story-behind-cloud-control-in-the-eu]]></link><comments><![CDATA[https://www.azure-heros.com/blog/data-residency-is-not-data-sovereignty-the-real-story-behind-cloud-control-in-the-eu#comments]]></comments><pubDate>Sun, 11 May 2025 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/data-residency-is-not-data-sovereignty-the-real-story-behind-cloud-control-in-the-eu</guid><description><![CDATA[A customer asked me recently whether they needed to worry about the US CLOUD Act if their Azure tenant was based in the Netherlands. They had read that their data stays in Europe, saw the "EU Data Boundary" label in the admin portal, and assumed that settled it.It doesn't. Not completely. And the gap between what they assumed and what the law actually says is exactly where digital sovereignty conversations go wrong.This post is about that gap: the difference between where your data physically si [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">A customer asked me recently whether they needed to worry about the US CLOUD Act if their Azure tenant was based in the Netherlands. They had read that their data stays in Europe, saw the "EU Data Boundary" label in the admin portal, and assumed that settled it.<br><span></span>It doesn't. Not completely. And the gap between what they assumed and what the law actually says is exactly where digital sovereignty conversations go wrong.<br><br><span></span></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/sovereignty-ms_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div><div id="935102705939477192" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><div class="container"><p>This post is about that gap: the difference between where your data physically sits and who has the legal authority to access it. They are not the same question, and treating them as one is a planning mistake that shows up later as a compliance problem.</p><h2>Two Terms That Sound the Same and Mean Different Things</h2><div class="compare-grid"><div class="compare-card residency"><h3>Data Residency</h3><p class="card-sub">A geographic question</p><ul><li>Where is the data physically stored?</li><li>In which datacenter region does processing happen?</li><li>Can I choose a specific country for my tenant?</li><li>Answered by your provider's regional configuration</li></ul></div><div class="compare-card sovereignty"><h3>Data Sovereignty</h3><p class="card-sub">A legal and jurisdictional question</p><ul><li>Which country's laws govern access to this data?</li><li>Can a foreign government compel access to it?</li><li>Who controls the encryption keys?</li><li>Answered by your provider's corporate structure and applicable law</li></ul></div></div><p>Residency is about geography. Sovereignty is about authority. A provider can offer you full data residency within the EU and still be subject to foreign jurisdiction. Location does not equal control.</p><h2>The Elephant in the Room: The US CLOUD Act</h2><p>The CLOUD Act (Clarifying Lawful Overseas Use of Data Act) is a US federal law passed in 2018. It gives US law enforcement the power to compel American technology companies to hand over data stored abroad, regardless of where it physically sits. The key word is "American companies." If your provider is incorporated in the United States, the CLOUD Act applies to their infrastructure everywhere, including datacenters in Frankfurt, Amsterdam, and Dublin.</p><p>This means that when Microsoft, Google, or Amazon tells you your data stays in the EU, they are telling you something true about geography and something incomplete about jurisdiction. The data does stay in the EU. But a US warrant can still reach it.</p><div class="callout red"><strong>What the CLOUD Act actually allows</strong> US law enforcement can issue warrants to US-based providers for data held abroad, without requiring the cooperation of the country where the data is located, without judicial review in that country, and without prior notice to the affected users or European regulators. It bypasses Mutual Legal Assistance Treaties (MLATs) entirely.</div><h2>How This Collides With GDPR</h2><p>Article 48 of the GDPR is specific: court orders from third countries (like the US) are only valid for transferring EU personal data if they are recognized through an international agreement, such as an MLAT. The CLOUD Act does not use MLATs. It is unilateral.</p><p>This creates a genuine legal dilemma for companies using US-based cloud providers:</p><ul class="pillar-list"><li><span class="pillar-icon">&#9878;&#65039;</span><span><strong>Comply with a US warrant.</strong> The provider hands over data. The organization may face GDPR enforcement action for unauthorized data transfer to a third country.</span></li><li><span class="pillar-icon">&#9878;&#65039;</span><span><strong>Refuse the US warrant.</strong> The provider faces legal consequences in the US. In practice, large US providers rarely refuse.</span></li><li><span class="pillar-icon">&#9878;&#65039;</span><span><strong>Use the "quash or modify" clause.</strong> The CLOUD Act allows providers to challenge warrants that conflict with foreign law. But this option is discretionary, complex, and rarely used in practice.</span></li></ul><p>The European Data Protection Board has been clear: service providers subject to EU law cannot legally base data transfers to the US solely on CLOUD Act requests. But GDPR applies to the data subject's rights, not directly to what the US government demands of a US company. The gap between those two jurisdictions is where sovereignty actually lives.</p><h2>What Microsoft Has Actually Built</h2><p>To Microsoft's credit, they have not ignored this. Over the last several years they have built real technical and contractual mechanisms to address sovereignty concerns. These are worth understanding carefully, because they are genuine progress on a hard problem, but they are also not a complete answer to the CLOUD Act question.</p><h3>The EU Data Boundary</h3><p>The EU Data Boundary is Microsoft's commitment to store and process Customer Data and personal data for their enterprise online services (Azure, Dynamics 365, Power Platform, and Microsoft 365) within the EU and EFTA countries. The EFTA countries included are Switzerland, Norway, Iceland, and Liechtenstein, in addition to all 27 EU member states.</p><p>As of February 2025, this covers:</p><ul class="pillar-list"><li><span class="pillar-icon">&#9989;</span><span><strong>Customer Data at rest and in processing.</strong> The actual content you store and work with stays within EU/EFTA datacenters.</span></li><li><span class="pillar-icon">&#9989;</span><span><strong>Pseudonymized system-generated logs.</strong> Microsoft requires all personal data in operational logs to be pseudonymized before it leaves the EU Data Boundary. Techniques include encryption, masking, tokenization, and data blurring.</span></li><li><span class="pillar-icon">&#9989;</span><span><strong>Professional Services Data at rest.</strong> Data from Microsoft consulting and support engagements is stored within the boundary.</span></li></ul><div class="callout warn"><strong>Important: the EU Data Boundary has documented exceptions</strong> There are specific circumstances where Customer Data will continue to be transferred outside the EU Data Boundary. Microsoft publishes these exceptions in detail on the EU Data Boundary documentation site. Architects should review the specific services they rely on, as coverage is not uniform across all Azure capabilities.</div><h3>Microsoft Cloud for Sovereignty</h3><p>The EU Data Boundary handles residency. Microsoft Cloud for Sovereignty is Microsoft's answer to the broader sovereignty question. It comes in three tiers:</p><div class="tier-grid"><div class="tier-card"><h4>Sovereign Public Cloud</h4><p>Built-in sovereignty controls within the standard Azure public cloud. Customer-managed encryption keys, European personnel approving operational access, tamper-evident access logs, and Azure Policy for governance alignment.</p></div><div class="tier-card"><h4>Sovereign Private Cloud</h4><p>Azure Local deployed on-premises or in a sovereign facility. Workloads operate in a hybrid or fully disconnected environment under local physical control. Designed for classified or highly regulated data.</p></div><div class="tier-card"><h4>National Partner Clouds</h4><p>Microsoft cloud capabilities operated by an independent, nationally licensed partner. The operator is not a US entity, which changes the jurisdictional picture materially. Examples include operator clouds in Germany and France.</p></div></div><p>The Sovereign Landing Zone (SLZ) is the Azure landing zone reference architecture that applies these controls prescriptively, with built-in Azure Policies for sovereignty alignment out of the box. If you are designing a sovereign-compliant environment on Azure, SLZ is the right starting point rather than building governance from scratch.</p><h3>Operational Access Controls</h3><p>One of the more substantive commitments Microsoft has made is around who can access your data for operational purposes. For European Microsoft Cloud services, operational access is approved by European personnel and tracked in tamper-evident logs. Customers can also bring and manage their own encryption keys on hardware security modules, which means that even Microsoft cannot decrypt data without explicit customer authorization.</p><p>This is meaningful. If Microsoft cannot technically read your data, a CLOUD Act warrant demanding that data becomes significantly harder to fulfill. "We cannot access it" is a defensible technical position, not just a contractual promise.</p><h2>The Honest Assessment: Marketing vs. Reality</h2><p>Several major US cloud providers now market offerings with "sovereign" in the name. It is worth being specific about what these actually deliver, because the term is used loosely enough that it requires scrutiny.</p><ul class="pillar-list"><li><span class="pillar-icon">&#9633;</span><span><strong>Microsoft 365 EU Data Boundary.</strong> Genuine commitment to data residency within EU/EFTA. Combined with customer-managed keys and the European operational access approvals, this is among the more substantive offerings from a US hyperscaler. The CLOUD Act still applies at the corporate level, but the technical controls reduce practical access.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Amazon European Sovereign Cloud.</strong> AWS launched a European Sovereign Cloud for the EU, designed to operate independently of AWS commercial operations with data stored in Germany. AWS employees in the EU manage it. This changes the operational picture, but the parent company is still a US entity, which means the CLOUD Act question is not fully resolved.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>"Sovereign cloud" as a marketing label.</strong> The term appears in many places where the underlying architecture has not materially changed. Data residency in an EU datacenter operated by a US company, without customer-managed encryption or separated operational control, is data residency. It is not sovereignty in any meaningful legal sense.</span></li></ul><blockquote>Sovereignty is not a marketing claim, it's a legal reality. If your provider is subject to US jurisdiction, your data may not be safe, even if stored within the EU. <cite>Wire Security Blog, July 2025</cite></blockquote><h2>What This Means for Azure Architects</h2><p>If you are designing solutions on Azure for EU public sector clients, regulated industries, or organizations with explicit sovereignty requirements, the framework has improved significantly but still requires deliberate design choices. Here is how I think about it:</p><ul class="pillar-list"><li><span class="pillar-icon">&#9633;</span><span><strong>Customer-managed keys are not optional for sensitive workloads.</strong> If Microsoft cannot read your data, the CLOUD Act becomes a much weaker lever. Azure Key Vault with customer-managed keys, or Bring Your Own Key (BYOK) stored on HSMs, should be standard for anything touching personal or regulated data. This is not just a compliance checkbox - it is a technical barrier that matters.</span></li><li><span class="pillar-icon">&#9633;&#65039;</span><span><strong>Use the Sovereign Landing Zone if sovereignty is a stated requirement.</strong> Do not design custom governance from scratch. The SLZ gives you Azure Policies pre-configured for sovereignty alignment, region restrictions, and operational controls. Start there and adapt to your specific regulatory context.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Understand the EU Data Boundary exceptions for your specific services.</strong> Not every Azure service is covered equally. Before making commitments to a client about where data lives, check the Microsoft EU Data Boundary documentation for the specific services in scope. Coverage continues to expand but is not yet uniform.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Sovereign Private Cloud for classified or highly regulated scenarios.</strong> If the requirement is "no US entity can even theoretically be compelled to access this data," the public cloud with enhanced controls is not sufficient. Azure Local in a national facility, or a national partner cloud operated by a non-US entity, is the architecture that actually addresses that requirement.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Know what MLAT means and when it applies.</strong> When a US legal request arrives, the correct response for EU organizations is to refer the case through the Mutual Legal Assistance Treaty process, which includes EU judicial oversight and GDPR-compatible safeguards. Organizations and their legal counsel should be familiar with this before it becomes urgent.</span></li></ul><div class="callout green"><strong>The good news for Azure architects</strong> The tooling has genuinely improved. Customer-managed keys, European operational access approvals, tamper-evident logs, the EU Data Boundary commitment, and the Sovereign Landing Zone together form a defensible architecture for most regulated EU workloads. The combination of technical controls and contractual commitments is stronger in 2025 than it was in 2020. The CLOUD Act has not gone away, but the practical surface area it can reach has been reduced for organizations that architect deliberately.</div><h2>My Take</h2><p>Digital sovereignty is one of those topics where the terminology creates false confidence. "EU Data Boundary" sounds definitive. "Microsoft Cloud for Sovereignty" sounds like it closes all the gaps. Neither is the full story.</p><p>The CLOUD Act is a real constraint. It has not been legislated away, and the EU-US Data Privacy Framework that replaced Privacy Shield has already faced legal challenges. Organizations that assume their provider's marketing language resolves the jurisdiction question are carrying risk they have not accounted for.</p><p>At the same time, writing off Microsoft's sovereign cloud investments as pure marketing underestimates what has actually been built. Customer-managed encryption, separated operational control, and the EU Data Boundary commitment are technical commitments, not just contractual ones. They change what is practically possible even under a legal demand.</p><p>The right framing for architects is this: use the tools that exist, understand what they do and do not cover, and match the architecture to the actual regulatory requirement. For most commercial EU workloads with GDPR obligations, the combination of EU Data Boundary plus customer-managed keys plus the Sovereign Landing Zone gets you to a defensible position. For government or classified data where the standard is "no foreign jurisdiction, period," the answer is a national partner cloud or private deployment where the operator is not a US entity. That requirement is different, and the architecture needs to reflect it.</p><div class="sources"><strong>References</strong><br><a href="https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn" target="_blank" rel="noopener noreferrer">EU Data Boundary Overview - Microsoft Learn (Updated February 2025)</a><br><a href="https://www.microsoft.com/en-us/industry/sovereignty/cloud" target="_blank" rel="noopener noreferrer">Microsoft Sovereign Cloud - microsoft.com</a><br><a href="https://aws.amazon.com/compliance/europe-digital-sovereignty/" target="_blank" rel="noopener noreferrer">Europe Digital Sovereignty - AWS Compliance</a><br><a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/sovereign-landing-zone" target="_blank" rel="noopener noreferrer">Sovereign Landing Zone - Azure CAF Documentation</a></div></div></div></div>]]></content:encoded></item><item><title><![CDATA[Azure CAF and WAF Solve Different Problems]]></title><link><![CDATA[https://www.azure-heros.com/blog/azure-caf-and-waf-solve-different-problems]]></link><comments><![CDATA[https://www.azure-heros.com/blog/azure-caf-and-waf-solve-different-problems#comments]]></comments><pubDate>Mon, 28 Apr 2025 21:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.azure-heros.com/blog/azure-caf-and-waf-solve-different-problems</guid><description><![CDATA[I've had some version of the same conversation across multiple Azure engagements. A team is mid-deployment, things are getting messy, and someone mentions the Well-Architected Framework. Then someone else says "yeah, we did that in the CAF phase." And the room just moves on, carrying a misunderstanding that will surface again later as delivery friction, control gaps, and rework nobody budgeted for.CAF and WAF are not the same thing. They don't cover the same ground. They don't happen at the same [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><span style="color:rgb(26, 26, 26); font-weight:400">I've had some version of the same conversation across multiple Azure engagements. A team is mid-deployment, things are getting messy, and someone mentions the Well-Architected Framework. Then someone else says "yeah, we did that in the CAF phase." And the room just moves on, carrying a misunderstanding that will surface again later as delivery friction, control gaps, and rework nobody budgeted for.</span><br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.azure-heros.com/uploads/8/5/6/6/8566957/caf-waf-az_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><!--BLOG_SUMMARY_END--></div><div><div id="808596867388573800" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><div class="container"><p>CAF and WAF are not the same thing. They don't cover the same ground. They don't happen at the same time. And treating them as interchangeable &mdash; or worse, skipping one because you think the other covered it &mdash; is a pattern I've seen cause real problems in real projects.</p><p>This post is my attempt to draw a clean line between the two, explain why the sequence matters, and be honest about where CAF falls short so you can plan around it.</p><h2>What Each Framework Actually Covers</h2><div class="compare-grid"><div class="compare-card caf"><h3>Cloud Adoption Framework (CAF)</h3><p class="card-sub">The platform and the organization</p><ul><li>Cloud strategy and business alignment</li><li>Landing zones and subscription design</li><li>Identity and access governance</li><li>Network topology and connectivity</li><li>Policy, compliance, and cost management</li><li>Operating model and team structure</li><li>Migration planning and execution</li></ul></div><div class="compare-card waf"><h3>Well-Architected Framework (WAF)</h3><p class="card-sub">Individual workloads on that platform</p><ul><li>Reliability (resiliency, recovery targets)</li><li>Security (data protection, threat mitigation)</li><li>Cost Optimization (usage, rate efficiency)</li><li>Operational Excellence (observability, deployments)</li><li>Performance Efficiency (scaling, load testing)</li></ul></div></div><p>The simplest framing I've found: CAF is about getting the cloud foundation right so workloads have somewhere solid to run. WAF is about making sure those workloads are worth running. They operate at different layers, answer different questions, and are owned by different people in most organizations.</p><h2>Why Teams Mix Them Up</h2><p>The confusion usually comes from the fact that both frameworks talk about security, costs, and operations. At a surface level they look like they overlap. But CAF security means governance policies, identity design, and network perimeter. WAF security means threat modeling a specific application, protecting its data, and reviewing its attack surface. Same word, completely different scope.</p><p>The other source of confusion is timeline. CAF is most visible at the start of a cloud journey, so teams associate it with "the early stuff." WAF comes up during workload reviews, so it feels like "the later stuff." The problem is when teams treat CAF as a phase you complete and move past, rather than a living foundation you maintain.</p><blockquote>Most teams don't actually confuse CAF vs WAF. They just rush CAF and call it done. Then WAF becomes a patching exercise on top of a weak foundation. <cite>Dheeraj Negi, Senior Azure Platform Architect</cite></blockquote><h2>What Actually Goes Wrong Without a Solid Foundation</h2><p>When CAF is treated as a checkbox rather than a real foundation, the symptoms show up gradually. The first few workloads land fine. But as the number of teams, subscriptions, and services grows, the cracks appear:</p><ul class="pillar-list"><li><span class="pillar-icon">&#9633;</span><span><strong>Governance gaps.</strong> Teams deploy directly to production because policy enforcement was never set up. Cost surprises follow because budgets and tagging weren't defined early.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Network dead ends.</strong> Landing zones were designed for the first three workloads and don't scale to the fifteenth. Connectivity to on-premises becomes a retrofit project.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>Identity debt.</strong> Service principals multiply without lifecycle management. Privileged access is broader than anyone intended. Audit trails are incomplete.</span></li><li><span class="pillar-icon">&#9633;</span><span><strong>WAF becomes a band-aid.</strong> Each workload review turns up the same platform-level findings &mdash; logging, access control, network segmentation &mdash; because these were never solved at the foundation level.</span></li></ul><blockquote>It's like optimising furniture arrangement in a house with a cracked foundation. CAF first, WAF always. That's the right order of operations. <cite>Suresh Guntha, Senior Principal Cloud Architect</cite></blockquote><h2>The Sequencing Principle</h2><p>The clearest mental model I've found for this is simple: CAF sets the floor, WAF raises the ceiling. You need both, but you can't skip the floor.</p><div class="sequence-steps"><div class="step"><div class="step-num">1</div><div class="step-body"><strong>Get the foundation right (CAF)</strong> Landing zones, governance policies, identity model, network topology. This doesn't mean perfect &mdash; it means intentional. You're making deliberate decisions about how the platform will operate, not just deploying and hoping for the best.</div></div><div class="step"><div class="step-num">2</div><div class="step-body"><strong>Review workloads against the five pillars (WAF)</strong> Once the foundation is stable, WAF gives each workload a structured lens for quality. Reliability targets, security posture, cost efficiency, operational observability, and performance design &mdash; all against a platform that can actually support them.</div></div><div class="step"><div class="step-num">3</div><div class="step-body"><strong>Treat both as ongoing disciplines, not one-time events</strong> CAF isn't a phase you graduate from. As the organization grows, as new teams onboard, as regulations change, the platform needs to evolve too. WAF reviews should be recurring &mdash; at major changes, at scale milestones, before production launches.</div></div></div><div class="callout green"><strong>The ownership question matters</strong> CAF needs a platform team that owns it like a product &mdash; with backlogs, sprints, and accountability. WAF needs workload teams that take the review seriously rather than treating it as a compliance checkbox. Neither works without a clear owner.</div><h2>Where CAF Falls Short (Being Honest)</h2><p>CAF is a genuinely useful framework and it's improved significantly over the years. But it has real gaps that are worth knowing about before you lean on it too heavily.</p><div class="weakness-grid"><div class="weakness-card"><h4>Azure-only scope</h4><p>CAF is built specifically for Azure. If your organization runs workloads across AWS or GCP, CAF won't cover those. You'd need to layer in something like the CNCF Cloud Maturity Model or the relevant vendor's framework alongside it.</p></div><div class="weakness-card"><h4>IaaS and migration bias</h4><p>CAF's most mature guidance is around VM migration, landing zones, and lift-and-shift patterns. Cloud-native workloads, microservices architectures, and PaaS-first designs get lighter treatment. The modernization guidance has improved, but there's still a gap if you're building greenfield cloud-native from the start.</p></div><div class="weakness-card"><h4>Complexity for smaller teams</h4><p>CAF in full scope assumes a team with dedicated cloud architects and governance specialists. For SMBs or smaller engineering teams, the full framework can be genuinely overwhelming and lead to analysis paralysis &mdash; spending more time designing the framework than actually deploying anything.</p></div><div class="weakness-card"><h4>The "Manage" phase gets dropped</h4><p>CAF has a Manage phase covering post-migration operations, monitoring, and ongoing optimization. In practice, it's the phase most often skipped. Teams complete the migration, declare success, and then wonder months later why operations are chaotic and costs keep climbing.</p></div></div><div class="callout warn"><strong>On the deprecated Terraform modules</strong> There's been noise about CAF being "deprecated" &mdash; worth clarifying. The CAF Terraform modules (AZTFMOD) were deprecated, not the framework itself. CAF as a strategy, methodology, and set of guidance documents is still actively maintained and evolving. Microsoft's Azure Verified Modules (AVM) is the recommended path forward for IaC implementation.</div><h2>My Take</h2><p>The mistake I see most often isn't confusion between CAF and WAF. It's underestimating what it takes to actually do CAF well. Teams treat landing zone deployment as the finish line when it's really just the starting point. Governance needs to be enforced, not just designed. Identity models need to be maintained, not just drawn on a whiteboard. Network topology needs to scale with the organization, not just with the first workload.</p><p>When the foundation is weak, WAF reviews turn into archaeology expeditions &mdash; digging up problems that should have been solved at the platform level. The same findings come up workload after workload because the root cause is never addressed.</p><p>The right framing is this: CAF is not a project that ends. WAF is not something you do once before go-live. Both are ongoing practices, and both need ownership. Get clear on who owns the platform and who owns each workload, make sure both teams have real accountability, and most of the confusion between CAF and WAF tends to sort itself out.</p><div class="sources"><strong>References</strong><br><a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/" target="_blank" rel="noopener noreferrer">Azure Cloud Adoption Framework - Microsoft Docs</a><br><a href="https://learn.microsoft.com/en-us/azure/well-architected/" target="_blank" rel="noopener noreferrer">Azure Well-Architected Framework - Microsoft Docs</a><br><a href="https://techcommunity.microsoft.com/blog/azurearchitectureblog/why-you-need-a-cloud-adoption-framework-caf-and-probably-a-waf-too/3667426" target="_blank" rel="noopener noreferrer">Why You Need a CAF and Probably a WAF Too - Azure Architecture Blog</a><br><a href="https://www.oneadvanced.com/resources/is-microsoft-caf-still-useful-in-2025/" target="_blank" rel="noopener noreferrer">Is Microsoft CAF Still Useful in 2025? - One Advanced</a></div></div></div></div>]]></content:encoded></item></channel></rss>