SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #512

lex

April 12, 2026

Ashby taught us we have to fight fire with fire

Improving robustness requires increasing complexity. Let’s throw more complexity at it?

I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.

Lorin Hochstein

Multi-Agent System Reliability

This feels like using multiple agents as a sort of redundancy and cross-validation architecture to improve the reliability of agent output..

Alex Ewerlöf

Why End-to-End Testing Fails in Microservice Architectures

This article explains why end-to-end testing breaks down in microservice-based systems, not due to poor tooling, but because of fundamental architectural and operational mismatches.

Alok Kumar — DZone

AI-generated code ships fast, but runtime control hasn’t kept up

LaunchDarkly’s survey data show have some interesting things to say about the impact of AI.

[…] while build and deployment velocity have improved, production reliability has not.

LaunchDarkly

The Picture They Paint of You

Fred Hebert surveyed how AI coding assistants vs. AI SRE tools are marketed and found a stark divide: coding assistants are framed as partners that augment engineers, while AI SREs are framed as replacements for low-value work. The implication is that the people building and buying these tools see incident response as grunt work to be automated away — and that says a lot about how decision-makers perceive the role.

Fred Hebert

5 Incident Response Principles for CTOs

I especially like the point that incidents are leadership moments — how you respond tells your team everything about the culture you’re building. This one is aimed at CTOs, but really it’s a great reminder for anyone in a leadership role during incidents.

Joe Mckevitt — Uptime Labs

Why Retries Are More Dangerous Than Failures

There’s a really interesting bit in this one about libraries and layers of the system doing their own retries without your knowledge, magnifying retry volume.

David Iyanu Jonathan — DZone

The post-mortem problem

I like the section on what AI should and shouldn’t do. It’s important to avoid automating away the process of learning from incidents.

incident.io

SRE Weekly Issue #511

lex

April 5, 2026

General

Comments

View on sreweekly.com

What Makes Incidents Different to Routine Work?

This one’s definitely going to be good to keep in mind during my next incident.

FYI for folks with no or low vision, there’s a screenshot of J. Paul Reed quoting Vanessa Huerta Granda: “Incidents are where engineers are made”.

Stuart Rimell — Uptime Labs

Migrating Etsy’s database sharding to Vitess

Etsy migrated a 1,000-table DB with 1,000 shards (with their own custom ORM!) over to vitess, and it took some care, especially in how they handled transactions.

Ella Yarmo-Gray — Etsy

We Automated Everything Except Knowing What’s Going On

Wow, this one sure hits hard.

Kenneth Eversole

Why our Kafka consumers survived the day but died every night

The section on lessons learned toward the end of this debugging story is a goldmine.

Lokesh Soni

Reliability Engineering for Air-Gapped Systems

How do you ensure reliability in a system you can’t access? How can you monitor SLIs/SLOs without metrics?

Alex Ewerlöf

How I Dragged Phantom Tide Out of an OOM Kill Loop

I love a good debugging story, and this one delivers, with a confluence of gnarly problems and lessons we can all learn from.

James Sawyer — Phantom Tide

Cloudflare outage on February 20, 2026

Oof, what a nasty little gotcha in the API call at the heart of this incident.

David Tuber and Dzevad Trumic — Cloudflare

Quick takes on Feb 20 Cloudflare outage

Lorin’s Law strikes again!

System intended to improve reliability contributed to incident

Lorin Hochstein

SRE Weekly Issue #510

lex

March 29, 2026

General

Comments

View on sreweekly.com

Building SRE Error Budgets for AI/ML Workloads: A Practical Framework

ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.

Varun Kumar Reddy Gajjala — DZone

Why Enterprises Overfund Failure and Underfund Prevention

Enterprises rarely fail because they don’t care about reliability.
They fail because:

failure is loud,

prevention is quiet,

and budgeting systems are wired to respond to noise.

Florian Hoeppner

Automating RDS Postgres to Aurora Postgres Migration

They had hundreds of databases to migrate, so they built a tested, self-service migration workflow.

Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, and Zhitao Zhu — Netflix

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

I love the technical description of socket juggling to achieve a graceful restart. I could swear that this technique has been around for decades though, for example in TinyMUX et al…

Manuel Olguín Muñoz — Cloudflare

Lots of AI SRE, no AI incident management

Lorin goes into what an AI incident manager might look like, since no tools of the sort exist yet.

Lorin Hochstein

When Kubernetes Forgets: The 90-Second Evidence Gap

By default, Kubernetes keeps a pretty short event history. This article argues that what we really need is the ability to know the state of the system at a specific time.

Shamsher Khan — DZone

Safeguarding dynamic configuration changes at scale

They built a platform for safely rolling out configuration changes. I like that it has a special mode for use in incident response.

Cosmo W. Q — Airbnb

Catching a caching bug at Readyset

This is a cool debugging story, and I love the emphasis on mental models. The bit about simulating different paths through the software is quite intriguing.

Michael Victor Zink — Readyset (via Antithesis)

SRE Weekly Issue #509

lex

March 22, 2026

General

Comments

View on sreweekly.com

SRE Weekly is back! My partner is doing well, and thanks for all the kind words and well-wishes.

Practical Considerations for AI Incident Reviews

There’s a lot you miss out on if you get an LLM to write your incident review.

incident reviews are fundamentally a socio-technical process, and they do not provide benefit if people don’t engage with them.

Fischer

The Hidden Cost of “We’ll Fix It Later”

I love this concept of reliability debt.

Spiros Economakis

What I’m Learning from Aviation About Incident Preparedness

This one starts with an insightful comparison of two commercial aviation incidents and the crew’s actions. It goes on to draw broader lessons that we can use as SREs.

Hamed Silatani — Uptime Labs

AI-Generated SQL Is Breaking Database Governance

What happens now that SQL is being written by LLMs? I love the analogy to the advent of ORMs that abstracted away the generation of SQL.

Tanmay Sinha — Readyset

Are bugs and incidents inevitable with AI coding agents? – Stack Overflow

What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

They did a survey of 470 codebases and share the numbers on the rate of bugs generated by LLMs versus humans.

David Loker — CodeRabbit

10 Real-World Status Page Examples: And What You Can Learn From Them

This post looks at ten real status page examples from teams that have dealt with outages at scale. Each example highlights what they communicate well, where they set expectations clearly, and how small details reduce confusion during incidents.

Laura Clayton — UptimeRobot

Disappointing People Early

If you don’t explicitly state your expected level of reliability, your customers will infer one and hold you to it anyway. “Disappoint” them early by telling them what to expect.

Dave O’Connor

On variability

Humans exhibit variation in how we respond to a given situation, and this article argues that it’s one of our strengths. LLMs intentionally also exhibit variability.

Lorin Hochstein

SRE Weekly Issue #508

lex

February 1, 2026

General

Comments

View on sreweekly.com

SRE Weekly will be going on hiatus for 6 weeks, while I’m on leave caring for my partner after her kidney transplant surgery this week. It’s incredible that the National Kidney Registry’s Paired Exchange program allowed me to donate a kidney to help her even though we don’t have matching blood types!

Software engineering when machine writes the code

What do we miss when we have LLMs write our code for us? This article explains that one thing we can miss out on is building a mental model.

Shayon Mukherjee

Decompensation and Cascading Failures

I really love this explanation of the concept of compensation.

Compensation is a very interesting mechanism in software systems because it can keep complex systems alive, but also because it can be a factor in how they quickly and unexpectedly collapse.

Fred Hebert — Resilience in Software Foundation

Telling the wrong story

When you investigate an incident and tell the story about what you found, but no one believes you because there’s no smoking gun or bad actor…

Lorin Hochstein

Why Reliability Starts with Responsibility

To build and maintain reliable systems, organizations must align responsibility with control. This is where the Ownership Trio—Mandate, Knowledge, and Accountability—comes in.

Spiros Economakis

Futureproofing Tines: Partitioning a 17TB table in PostgreSQL

I love when an article goes through the designs they passed over (and why) before reaching their final design, as in this one.

Julianne Walker — Tines

Docker lazy loading at Grab: Accelerating container startup times

If you’re unfamiliar with Docker image lazy loading like I was, this is a great primer on two options, Estargz and SOCI.

Huong Vuong and Joseph Sahayaraj — Grab

Chasing Boring at Just the Right Speed

But don’t let MTTR become the thing you’re optimising for. The goal is to build systems and processes where you’re constantly learning and improving, not systems where you’re just really efficient at fighting the same fires over and over.

Dave O’Connor

I watched a supposedly “resilient” Multi-Region setup completely implode recently. The architecture diagram looked great – active workloads in US-East, cold standby in US-West. But when the provider had a global IAM service degradation, the whole thing became a brick.

u/NTCTech on Reddit

SRE Weekly Issue #512

SRE Weekly Issue #511

SRE Weekly Issue #510

SRE Weekly Issue #509

SRE Weekly Issue #508

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Archera:

A message from our sponsor, Depot:

A message from our sponsor, Clickhouse:

A message from our sponsor, Costory :

A message from our sponsor, Costory:

Subscribe

RSS

Mastodon

Search Issues