Design overview¶
What it is¶
recourse is a policy-driven resilience library for Go services. Call sites supply a stable key, policies define the retry envelope, and the executor records structured timelines and observer events so behavior is explainable in production.
Who it is for¶
- Teams running multiple services that need consistent retry behavior.
- Platform or reliability engineers who want centralized policy control.
- Services that need incident-friendly, per-attempt observability.
- Teams willing to enforce low-cardinality keys and policy discipline.
Who it is not for¶
- One-off scripts or small apps where manual retries are enough.
- Workloads that cannot tolerate duplicate attempts without safeguards.
- Teams that do not want to manage policy governance or key conventions.
- Situations where retries are simply the wrong tool.
Mental model¶
A key chooses a policy, and the policy defines the envelope:
In more detail:
- Call sites provide a low-cardinality policy key such as "payments.Charge".
- A policy provider resolves the effective policy for that key.
- The executor runs attempts using classification, backoff, budgets, hedging, and circuit breaking as configured.
- Observability artifacts are emitted (timeline records and observer callbacks).
The operational contract¶
- Attempts, backoff, and timeouts are bounded by policy.
- Classifiers decide whether an outcome should retry, stop, or abort.
- Budgets provide backpressure so retries do not amplify outages.
- Behavior is explainable through structured timelines and observer events.
- Context cancellation is respected across attempts, sleeps, and hedges.
Tradeoffs and organizational cost¶
- Keys must be stable and low-cardinality, which requires governance and review.
- Policies need ownership and careful rollout to avoid surprises.
- Backpressure and hedging can change load patterns and must be introduced intentionally.
- Remote configuration adds a control plane dependency that must be operated reliably.
How to start¶
- Pick a key convention and enforce low-cardinality rules. See Policy keys and Gotchas and safety checklist.
- Integrate via the facade in one service. See Getting started and Adoption guide.
- Capture timelines for critical calls to validate behavior. See Observability and Incident debugging.