If you followed the Claude Code terraform incident last week - Claude Code ran terraform destroy on production, took down 2.5 years of course submissions - you probably read Alexey's postmortem and the 500+ comment HN thread about it.
What struck me reading the postmortem wasn't the destruction itself. It was the decision chain: no remote state backend, deletion protection disabled, a Terraform archive from the old machine sitting there with full production state. Claude actually flagged the risk at multiple points. The human approved the destroy anyway.
I built a playable version of that session. You sit in a split-panel Claude Code interface - terminal on one side, AI agent on the other - and work through the recovery. The scenario uses the same kind of setup that caused the original disaster. It takes about 10-15 minutes.
This is part of YouBrokeProd, a browser-based incident response trainer I've been building. 10 scenarios total built from real postmortems - connection pool exhaustion, Kubernetes crashloops, DNS failures, SSL expiry, and others. Three are free including this one.
Stack: Next.js, Turso (SQLite at the edge), Supabase Auth. Each scenario is a state machine - you run commands, get realistic output back, form a hypothesis, and submit a diagnosis and fix. Scored on speed, accuracy, and efficiency.
The hardest part has been writing log output that's realistic enough to teach something but designed well enough to actually be solvable in 15 minutes. Curious what the SRE folks here think of the tradeoff.
I mean its cute, but i've seen humans do this in production as well with Terraform without AI tools.
You can stick an AWS architect badge on someone's forehead, or label them devops, but if there's a lack of systems experience and unpredictable tools at hand, disaster can and will happen.
the scenario design choice that matters most here is whether the game reveals the permission model before or after the agent makes the destructive call — learning happens differently if you're diagnosing why something bad already happened vs catching the mistake in real-time. postmortem replay (your approach) builds different intuition than pre-incident 'spot the misconfiguration'. both valuable but complementary
Thanks! SadServers is great - love what Fernando built there. The main twist here is that the scenarios are built from real postmortems rather than generic server puzzles. The terraform one is modeled directly on the Claude Code incident from last week.
I like the idea and wanted to play it out but after the incident began nothing happened and was stuck on waiting for incident to start, or "start incident"
Thanks for trying it out! Just pushed a fix - there was a bug where the game engine wasn't starting properly after clicking GO. Should work now. Create a free account and give it another shot, would love to hear how you do.
If you followed the Claude Code terraform incident last week - Claude Code ran terraform destroy on production, took down 2.5 years of course submissions - you probably read Alexey's postmortem and the 500+ comment HN thread about it.
What struck me reading the postmortem wasn't the destruction itself. It was the decision chain: no remote state backend, deletion protection disabled, a Terraform archive from the old machine sitting there with full production state. Claude actually flagged the risk at multiple points. The human approved the destroy anyway.
I built a playable version of that session. You sit in a split-panel Claude Code interface - terminal on one side, AI agent on the other - and work through the recovery. The scenario uses the same kind of setup that caused the original disaster. It takes about 10-15 minutes.
This is part of YouBrokeProd, a browser-based incident response trainer I've been building. 10 scenarios total built from real postmortems - connection pool exhaustion, Kubernetes crashloops, DNS failures, SSL expiry, and others. Three are free including this one.
Stack: Next.js, Turso (SQLite at the edge), Supabase Auth. Each scenario is a state machine - you run commands, get realistic output back, form a hypothesis, and submit a diagnosis and fix. Scored on speed, accuracy, and efficiency.
The hardest part has been writing log output that's realistic enough to teach something but designed well enough to actually be solvable in 15 minutes. Curious what the SRE folks here think of the tradeoff.
I mean its cute, but i've seen humans do this in production as well with Terraform without AI tools.
You can stick an AWS architect badge on someone's forehead, or label them devops, but if there's a lack of systems experience and unpredictable tools at hand, disaster can and will happen.
the scenario design choice that matters most here is whether the game reveals the permission model before or after the agent makes the destructive call — learning happens differently if you're diagnosing why something bad already happened vs catching the mistake in real-time. postmortem replay (your approach) builds different intuition than pre-incident 'spot the misconfiguration'. both valuable but complementary
Nice, this is like SadServers with a twist, excellent :-)
Thanks! SadServers is great - love what Fernando built there. The main twist here is that the scenarios are built from real postmortems rather than generic server puzzles. The terraform one is modeled directly on the Claude Code incident from last week.
Lots more to come
I like the idea and wanted to play it out but after the incident began nothing happened and was stuck on waiting for incident to start, or "start incident"
Thanks for trying it out! Just pushed a fix - there was a bug where the game engine wasn't starting properly after clicking GO. Should work now. Create a free account and give it another shot, would love to hear how you do.
Interesting. love the concept and super relevant.
thanks, let me know if you try a scenario