Azure outage disrupts VMs and identity services for over 10 hours
Microsoft’s Azure cloud platform suffered a broad multi-hour outage beginning on Monday evening, disrupting two critical layers of enterprise cloud operations. The outage, which lasted over 10 hours, began at 19:46 UTC on Monday and was resolved by 06:05 UTC on Tuesday.
The incident initially left customers unable to deploy or scale virtual machines in multiple regions. This was followed by a related platform issue with the Managed Identities for Azure Resources service in the East US and West US regions between 00:10 UTC and 06:05 UTC on Tuesday. The disruption also briefly affected GitHub Actions.
Policy change at the core of disruption
A policy change unintentionally applied to a subset of Microsoft-managed storage accounts, including those used to host virtual machine extension packages, led to this outage. The change blocked public read access that disrupted scenarios such as virtual machine extension package downloads, according to Microsoft’s status history.
Logged under tracking ID FNJ8-VQZ, some customers experienced failures when deploying or scaling virtual machines, including errors during provisioning and lifecycle operations. Other services were impacted as well.
Azure Kubernetes Service users experienced failures in node provisioning and extension installation, while Azure DevOps and GitHub Actions users faced pipeline failures when tasks required virtual machine extensions or related packages. Operations that required downloading extension packages from Microsoft-managed storage accounts also saw degraded performance.
Although an initial mitigation was deployed within about two hours, it triggered a second platform issue involving Managed Identities for Azure Resources. Customers attempting to create, update, or delete Azure resources, or acquire Managed Identity tokens, began experiencing authentication failures.
Microsoft’s status history page, logged under tracking ID M5B-9RZ, acknowledged that following the earlier mitigation, a large spike in traffic overwhelmed the managed identities platform service in the East US and West US regions.
This impacted the creation and use of Azure resources with assigned managed identities, including Azure Synapse Analytics, Azure Databricks, Azure Stream Analytics, Azure Kubernetes Service, Microsoft Copilot Studio, Azure Chaos Studio, Azure Database for PostgreSQL Flexible Servers, Azure Container Apps, Azure Firewall, and Azure AI Video Indexer.
After multiple infrastructure scale-up attempts failed to handle the backlog and retry volumes, Microsoft ultimately removed traffic from the affected service to repair the underlying infrastructure without load.
“The outage didn’t just take websites offline, but it halted development workflows and disrupted real-world operations,” said Pareekh Jain, CEO at EIIRTrend & Pareekh Consulting.
Cloud outages on the rise
Cloud outages have become more frequent in recent years, with major providers such as AWS, Google Cloud, and IBM all experiencing high-profile disruptions. AWS services were severely impacted for more than 15 hours when a DNS problem rendered the DynamoDB API unreliable.
In November, a bad configuration file in Cloudflare’s Bot Management system led to intermittent service disruptions across several online platforms. In June, an invalid automated update disrupted the company’s identity and access management (IAM) system, resulting in users being unable to use Google to authenticate on third-party apps.
“The evolving data center architecture is shaped by the shift to more demanding, intricate workloads driven by the new velocity and variability of AI. This rapid expansion is not only introducing complexities but also challenging existing dependencies. So any misconfiguration or mismanagement at the control layer can disrupt the environment,” said Neil Shah, co-founder and VP at Counterpoint Research.
Preparing for the next cloud incident
This is not an isolated incident. For CIOs, the event only reinforces the need to rethink resilience strategies.
In the immediate aftermath when a hyperscale dependency fails, waiting is not a recommended strategy for CIOs, and they should focus on a strategy of stabilize, prioritize, and communicate, stated Jain. “First, stabilize by declaring a formal cloud incident with a single incident commander, quickly determining whether the issue affects control-plane operations or running workloads, and freezing all non-essential changes such as deployments and infrastructure updates.”
Jain added that the next step is to prioritize restoration by protecting customer-facing run paths, including traffic serving, payments, authentication, and support, and, if CI/CD is impacted, shifting critical pipelines to self-hosted or alternate runners while queuing releases behind a business-approved gate. Finally, communicate and contain by issuing regular internal updates that clearly state impacted services, available workarounds, and the next update time, and by activating pre-approved customer communication templates if external impact is likely.”
Shah noted that these outages are a clear warning for enterprises and CIOs to diversify their workloads across CSPs or go hybrid and add necessary redundancies. To prevent future outages from impacting operations, they should also manage the size of the CI/CD pipelines and keep them lean and modular.
Even the real-time vs non-real-time scaling strategy, especially for crucial code or services, should be well thought through. CIOs should also have a clear understanding and operational visibility of hidden dependencies, knowing what could be impacted in such scenarios, and have a robust mitigation plan.
Original Link:https://www.infoworld.com/article/4127149/azure-outage-disrupts-vms-and-identity-services-for-over-10-hours-2.html
Originally Posted: Wed, 04 Feb 2026 11:36:51 +0000












What do you think?
It is nice to know your opinion. Leave a comment.