Building for Production: Failure, Recovery, and Real-World Resilience

Why software that demos well often fails in production, and how to build systems that survive the chaos of the real world.

Photo of Brandon Lanthrip

Brandon Lanthrip

January 8, 2025

This is part 2 of a 5-part series on building evolvable architecture. ← Previous: The Architecture Manifesto

Building for Production: Failure, Recovery, and Real-World Resilience

Production is where good intentions meet harsh reality. This is where your carefully crafted demo system faces real users, real data, and real problems. Many systems that shine in controlled environments crumble under the unpredictable stress of the real world.

Principle 1: Design for Production, Not Just Delivery

A lot of modern software design resembles the American automotive industry from the 70’s through the early 2000’s. The Chevrolet Vega was a car that could demo very well. It won the “1971 car of the year” from Motor Trend magazine, and even an award from the American Iron and Steel Institute for “excellence in design and transportation equipment.”

First impressions are often wrong though, as the Vega ultimately became widely known for problems ranging from its reliability, its safety, and even its exceptional ability to rust through its own panels. Ultimately it was able to tarnish the reputation of General Motors after six short years on the market.

We don’t want to build a Chevrolet Vega. We want to build a real machine built for the road: systems that welcome abuse, survive wear, and stay running long after the shine wears off. Flashy software that only thrives in lab conditions is a liability.

The Real Cost of Fragility

Instability carries real economic consequences. It doesn’t just cost developer time—it bleeds revenue, erodes reputation, and compounds operational drag. The cost of fragility is rarely visible on day one, but it always collects its rent.

Although no one sets out to build a brittle system, these common patterns can all but guarantee them:

Demo Driven Design

Impressive in sterile environments with no users and no real data, but disastrous in production. The ultimate goal is to reliably solve real problems at scale. Your definition of “done” should reflect that.

Checkbox Engineering

Moving tickets does not mean moving a product forward. A system that elegantly meets the spec and has all the sprint tickets in the “done” column can still be a nightmare to operate. Delivery velocity is meaningless if it produces fragile systems.

Going Bankrupt on Technical Debt

Shipping fast by cutting corners leads to long-term drag. Fragile features generate support debt that slows everything else down. Most systems live far longer in production than in development, so trading operation cost for short-term delivery is a bad bet.

Integration Fragility

Good software is cynical. It expects other systems to fail and protects itself when they do. Tightly coupled microservices often have all the complexity of distributed systems, with all the entanglements of monoliths. When highly interactive complexity meets tight coupling, even small faults can trigger cascading failure.

Principle 2: Anticipate Failure, Then Survive It

New software often enters this world as we all do: naive, optimistic, maybe even a little cocky. But once it is forced to leave the predictable and comforting confines of the lab and face the harsh realities of the real world, it is sink or swim. Things happen in the real world that just don’t happen in the lab—usually bad things.

Nobody ever has all the right answers, and every system will make mistakes. But the ones that survive are the ones designed to absorb adversity and keep moving. Resilient systems don’t panic when failure arrives, they expect it.

Anti-patterns That Kill Systems

Happy Path Obsession

Software that only works when everything works isn’t real software, it’s wishful thinking. Systems must be exercised under real-world stress, not just pampered in idealized test cases.

Faults Cascading to Failures

Not every bug needs to lead to full failure or loss of function. Faults are inevitable, but they can be properly contained as to not cascade. Resilient systems slow or can even stop crack propagation before critical functionality is lost.

Intimate Integration

A single bad integration shouldn’t be able to sink a whole ship. Build systems that isolate themselves defensively. Use circuit breakers, proper timeout enforcement and bulkheads to prevent local failure from becoming systemic collapse.

Silent Failures

The worst failures are silent, and when a fault festers in a system like an untended wound it can cause irreparable damages to data or leave the system in a bad state. When something breaks, make noise quick. Fail loudly and fail early in ways your operators can see and act on.

Unpredictable Recovery

Can the system return to a stable state without restarting the world? Can it replay failed events or recover gracefully from partial failure? Stability doesn’t mean never going down, it means recovering predictably. Survivability is a design goal.


Building for production means accepting that your system will be tested by forces you can’t control or predict. The question isn’t whether failure will happen—it’s whether your system will survive it.

Next up: Decoupling and Domain Boundaries →