This typo sparked a Microsoft Azure outage

Microsoft Azure DevOps, a suite of application lifecycle services, stopped working in the South Brazil region for about ten hours on Wednesday due to a basic code error.

On Friday Eric Mattingly, principal software engineering manager, offered an apology for the disruption and revealed the cause of the outage: a simple typo that deleted seventeen production databases.

Mattingly explained that Azure DevOps engineers occasionally take snapshots of production databases to look into reported problems or test performance improvements. And they rely on a background system that runs daily and deletes old snapshots after a set period of time.

During a recent sprint – a group project in Agile jargon – Azure DevOps engineers performed a code upgrade, replacing deprecated Microsoft.Azure.Managment.* packages with supported Azure.ResourceManager.* NuGet packages.

The result was a large pull request of changes that swapped API calls in the old packages for those in the newer packages. The typo occurred in the pull request – a code change that has to be reviewed and merged into the applicable project. And it led the background snapshot deletion job to delete the entire server.

“Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database,” said Mattingly.

Azure DevOps has tests to catch such issues, but according to Mattingly, the errant code only runs under certain conditions and thus isn’t well covered under existing tests. Those conditions, presumably, require the presence of a database snapshot that is old enough to be caught by the deletion script.

Mattingly said Sprint 222 was deployed internally (Ring 0) without incident due to the absence of any snapshot databases. Several days later, the software changes were deployed to the customer environment (Ring 1) for the South Brazil scale unit (a cluster of servers for a specific role). That environment had a snapshot database old enough to trigger the bug, which led the background job to delete the “entire Azure SQL Server and all seventeen production databases” for the scale unit.

The data has all been recovered, but it took more than ten hours. There are several reasons for that, said Mattingly.

One is that since customers can’t revive Azure SQL Servers themselves, on-call Azure engineers had to handle that, a process that took about an hour for many.

Another reason is that the databases had different backup configurations: some were configured for Zone-redundant backup and others were set up for the more recent Geo-zone-redundant backup. Reconciling this mismatch added many hours to the recovery process.

“Finally,” said Mattingly, “Even after databases began coming back online, the entire scale unit remained inaccessible even to customers whose data was in those databases due to a complex set of issues with our web servers.”

These issues arose from a server warmup task that iterated through the list of available databases with a test call. Databases in the process of being recovered chucked up an error that led the warm-up test “to perform an exponential backoff retry resulting in warmup taking ninety minutes on average, versus sub-second in a normal situation.”

Further complicating matters, this recovery process was staggered and once one or two of the servers started taking customer traffic again, they’d get overloaded, and go down. Ultimately, restoring service required blocking all traffic to the South Brazil scale unit until everything was sufficiently ready to rejoin the load balancer and handle traffic.

Various fixes and reconfigurations have been put in place to prevent the issue from recurring.

“Once again, we apologize to all the customers impacted by this outage,” said Mattingly. ®

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
Deal | The popular 1TB Samsung 980 Pro NVMe PCIe 4.0 SSD is back on sale thumbnail

Deal | The popular 1TB Samsung 980 Pro NVMe PCIe 4.0 SSD is back on sale

Reviews, News, CPU, GPU, Articles, Columns, Other "or" search relation.3D Printing, 5G, Accessory, AI, Alder Lake, AMD, Android, Apple, ARM, Audio, Biotech, Business, Camera, Cannon Lake, Cezanne (Zen 3), Charts, Chinese Tech, Chromebook, Coffee Lake, Comet Lake, Console, Convertible / 2-in-1, Cryptocurrency, Cyberlaw, Deal, Desktop, E-Mobility, Education, Exclusive, Fail, Foldable, Gadget, Galaxy Note, Galaxy S,…
Read More
Shrnutí zahraničních dojmů z Total War: Warhammer 3 thumbnail

Shrnutí zahraničních dojmů z Total War: Warhammer 3

Právě dnes padlo embargo na novinářské dojmy z preview buildu pravděpodobně nejočekávanější strategické hry letošního roku, takže jsme zalovili v zahraničních médiích a nabízíme zprostředkované dojmy i ukázky z hraní zahraničních kolegů. Windows Central: svoboda a potenciál RPG prvků Při budování Total War: Warhammer 3 mělo studio Creative Assembly naprostou důvěru ze strany mateřské firmy…
Read More
Splinter group officially launches OpenTF fork of HashiCorp Terraform thumbnail

Splinter group officially launches OpenTF fork of HashiCorp Terraform

Earlier this month HashiCorp announced it was changing the open source license it uses for Terraform and its other developer tools. The change triggered an uproar in the open source community. On Friday, a splinter group announced it was developing an open source fork of Terraform, and officially launched the OpenTF project. “We completed all
Read More

Steve Jobs is a legend, Tim Cook is just a manager. Talking about a man who will stay in the shadows forever. Wrong

Po deseti letech Tima Cooka ve funkci výkonného ředitele je Apple nejhodnotnější firmou na světě, patří mezi ty nejvlivnější, prodala dvě miliardy telefonů a růst se zdá být nezastavitelný. Steve Jobs byl vizionář, prosadil přelomové myšlenky a produkty. Tim Cook je dokázal prodat. Pojďme projít milníky v kariéře Tima Cooka a jeho desetiletého šéfování Applu. Těžké…
Read More
Index Of News
Total
0
Share