22 Apr 2025 2 min read devops

Designing for Failure: How to Build Systems That Don’t Panic

Most systems are built with the happy path in mind. They work beautifully in dev, pass all the tests, and even make it to production smoothly.

Until they don’t.

This post is about designing systems that expect failure. Systems that degrade gracefully. Systems that don’t panic.

The Real World Is a Mess

Every service you depend on can and will go down. APIs will throw timeouts. Connections will reset. Someone will deploy a bad config and walk away. If your system treats those as rare surprises instead of expected scenarios, you’re one incident away from disaster.

Designing for failure means baking in tolerance and resilience. It’s not just about high availability—it’s about survivability.

Principles of Fault-Tolerant Design

Here are the patterns to build systems that bend, not break:

1. Fail Open vs Fail Closed

Decide what the right kind of failure is.

A payment processor should fail closed—better to block than let a charge go through twice. A feature flag service? That should probably fail open with cached defaults so your entire app doesn’t crash because of a missing toggle.

Rule of thumb:
Fail open for non-critical paths. Fail closed for high-integrity boundaries.

2. Set Timeouts and Retries (but not blindly)

Default retry behavior is dangerous. You need:

Timeouts: Never trust a network call without one.
Backoff: Exponential with jitter, not fixed intervals.
Limits: Cap retry attempts. Infinite retries turn minor outages into thundering herds.

Pro tip: Retries without idempotency guarantees = duplicate side effects. Don’t do it.

3. Circuit Breakers and Bulkheads

Don’t let one failure take down everything.

Circuit breakers trip when downstream errors exceed a threshold. They stop calls temporarily and let systems recover.
Bulkheads isolate subsystems (like ship compartments). A failure in one doesn’t flood the rest.

Use libraries like Netflix’s Hystrix (Java) or Polly (.NET) or build equivalents into your Node.js services.

4. Graceful Degradation

Not every feature needs to work 100% of the time. If an upstream ML service is down, don’t block the entire user request—serve a static fallback or partial result.

Great UX comes from thoughtful degradation paths.
Blank screens and spinners are a failure of imagination.

5. Observability is Not Optional

You can’t fix what you can’t see. That means:

Structured logs with trace IDs
Health and readiness probes
Real-time alerts on failures (and absence of success)
Dashboards that expose intent, not just uptime

When something breaks, your system should be the first to know—not your customer.

Quick Resilience Checklist

Area	Pattern to Apply	Notes
External APIs	Timeouts + Retries + Circuit Breakers	Use per-call overrides, not global
Message Processing	Idempotency + Dead-letter Queues	Store last-seen state
Database Writes	Transactions + Retry Logic	Watch out for race conditions
Feature Flags	Fail Open + Fallback Values	Cache last good state locally
Service Startup	Readiness Probes + Retry Configs	Don’t let pods go live before ready
User-Facing Features	Graceful Degradation Paths	Never break the core experience

Takeaway:
You can’t stop failure—but you can make it survivable. Build your systems like the world is on fire. Because someday, it will be.