Retries are policy, not control flow¶
A retry is not free.
Every extra attempt is more concurrency, more sockets, more CPU, and more pressure on the dependency that is already unhappy.
That means retry behavior is not just local control flow. It is operational policy.
The moment you add retries, you are making a promise:
- attempts are bounded
- failures are classified correctly
- backoff is deliberate
- load is gated when a dependency is struggling
- behavior is explainable during an incident
recourse exists to make that promise repeatable in Go services.
It gives call sites a boring API and moves resilience decisions into a policy-keyed envelope: retries, timeouts, budgets, hedging, circuit breaking, classifiers, and timelines.
As of v1.2.0, the core API is covered by the v1.x compatibility guarantee, and optional integrations such as gRPC and OpenTelemetry live in separate modules so the root module stays small.
Who this is for¶
recourse is not for every retry.
If you need one small retry loop at one or two call sites, a focused backoff helper is probably the right tool. Keep the dependency small. Keep the behavior local. There is no virtue in turning a simple call into a platform.
recourse is for the point where retry behavior starts becoming an operating concern:
- multiple services have copied slightly different retry loops
- teams disagree about which failures are retryable
- incidents are made worse by retry amplification
- nobody can tell what happened on each attempt
- retry, timeout, circuit, and budget behavior needs to be governed consistently
That is the line where retries stop being a helper function and start being policy.
The core move: the call site picks a key rather than a mechanism¶
Most retry libraries start with mechanism:
- how many attempts?
- which backoff?
- which errors?
- which timeout?
recourse starts from a different primitive:
A low-cardinality policy key is the unit of control.
user, err := recourse.DoValue[User](
ctx,
"user-service.GetUser",
func(ctx context.Context) (User, error) {
return client.GetUser(ctx, userID)
},
)
The call site names the operation. The policy defines the envelope.
That envelope can include:
- retry limits and backoff
- per-attempt and overall timeouts
- budgets as backpressure gates
- hedging configuration
- circuit breaking configuration
Classification is handled separately by classifiers that map (value, err) into an Outcome.
That distinction matters. Call sites should not have to re-implement resilience semantics every time they call another service.
Keys must be low-cardinality or the model breaks¶
Once keys select policy, they also become observability dimensions. They feed caches, breakers, budgets, latency trackers, metrics, traces, and logs.
So keys must be stable and low-cardinality.
Good keys:
"payments.Charge""user-service.GetUser""db.Users.Query"
Bad keys:
"GET /users/123""user-service.GetUser?user_id=123""payments.Charge:tenant=acme"
A policy key is not a request label. It is the name of an operation class.
Dynamic data belongs in logs, traces, structured fields, or application-level diagnostics. It does not belong in the policy key.
ADR-001 is explicit about this because a lot of the system assumes keys are safe to aggregate on.
Policies are untrusted input, so normalize them¶
If policy is data, policy can be wrong.
That has to be treated as a real failure mode.
Every EffectivePolicy is normalized and clamped through EffectivePolicy.Normalize():
- unsafe or missing values are pushed into documented ranges
- normalization metadata is recorded
Policy resolution failure is also explicit. By default, recourse fails closed with FailureDeny and returns retry.NoPolicyError; callers can check it with errors.Is(err, retry.ErrNoPolicy). If a service wants different behavior, it can opt into a single-attempt allow mode or a fallback policy.
That is not ceremony. It is a guardrail against accidental busy loops, runaway attempts, and configuration mistakes becoming incidents.
Failure semantics belong in classifiers¶
A timeout, a 429, a 404, a connection reset, and quota exhaustion are not the same operational event.
Treating them the same is how retry code causes self-inflicted outages.
recourse uses classifiers. A classifier maps (value, err) into an Outcome, and the executor uses that outcome to decide whether to retry, stop, or abort.
Built-ins include:
classify.AutoClassifier, which dispatches to HTTP semantics when the error implementsHTTPErrorand otherwise falls back to retry-on-errorclassify.HTTPClassifier, which understands idempotent HTTP methods, transport errors, 5xx, 408/429, configured extra 4xx, andRetry-After- a gRPC classifier in the gRPC integration module that interprets gRPC status codes and delegates non-gRPC errors
The important part is not that these classifiers exist. The important part is that retryability is a named semantic decision, not an accidental side effect of if err != nil.
Backpressure is part of the retry contract¶
Retries and hedges multiply load.
Without explicit backpressure, a small dependency problem can become a retry storm.
Budgets in recourse provide a retry and hedge gate:
- allow the attempt
- deny the attempt and record why
- optionally return a release handle for reservation-style resources
The first/base attempt is not charged to the retry budget. Budgeting starts when the executor launches a retry attempt; hedges can have a separate hedge budget.
Built-ins include:
budget.UnlimitedBudgetbudget.TokenBucketBudget
Budget failure modes are explicit and observable:
- empty budget name is allowed with reason
"no_budget" - missing registry, missing budget, and nil budget are controlled by
MissingBudgetMode - the default for missing budget dependencies is fail-closed
If a dependency is already overloaded, retrying harder is often the wrong answer. Budgets are how recourse makes that part of the contract.
Hedging is tail-latency tooling, not just parallel retries¶
Average latency can look fine while p95 and p99 latency dominate user experience.
recourse supports hedging: starting another attempt while the first is still in flight, then returning the first successful result.
Supported modes include:
- fixed-delay hedging
- latency-aware hedging based on recent per-key latency stats
The behavior is explicit:
- first success wins and cancels the group context
CancelOnFirstTerminalcan stop the group on a non-retryable outcome- hedges can use their own budget separate from normal retry attempts
Hedging is powerful and dangerous. It only belongs in systems with budgets, cancellation, and observability.
Circuit breaking belongs in the same envelope¶
Retries help when failures are transient.
When failures are persistent, retries are waste.
recourse includes a consecutive-failure circuit breaker with standard states:
- closed
- open
- half-open
When the circuit is open, calls fail fast with CircuitOpenError. After cooldown, the breaker allows limited half-open probes. Success closes the circuit; failure reopens it.
Because circuit breaking is integrated with the rest of the envelope, the behavior can be coordinated. For example, hedging is disabled during half-open probing so a recovering dependency is not hammered by parallel probes.
The goal is not to bolt several resilience mechanisms together. The goal is to make them cooperate.
Explainability: timelines and observer hooks¶
Retry behavior is invisible unless you deliberately surface it.
During an incident, “the call failed” is not enough.
You usually need to know:
- how many attempts happened
- why each attempt was retryable or terminal
- whether budget/backpressure allowed the attempt
- how long each attempt took
- what backoff was chosen
- whether the final failure was exhaustion, cancellation, denial, or a terminal classification
recourse gives you two complementary paths.
Timeline capture¶
For debugging, capture an observe.Timeline on demand:
ctx, capture := observe.RecordTimeline(ctx)
user, err := recourse.DoValue(ctx, "user-service.GetUser", op)
tl := capture.Timeline()
for _, a := range tl.Attempts {
// a.Outcome, a.Err, a.Backoff, a.IsHedge, a.BudgetAllowed, ...
}
A representative timeline might tell this story:
attempt=0 reason=http_503 budget_allowed=true backoff=50ms err=upstream 503
attempt=1 reason=http_503 budget_allowed=true backoff=100ms err=upstream 503
attempt=2 reason=success budget_allowed=true backoff=0s err=<nil>
The timeline records per-attempt timings, outcomes, errors, backoff decisions, and budget gating, plus call-level attributes when present.
Streaming observers¶
For logs, metrics, and tracing integrations, implement observe.Observer.
The OpenTelemetry integration lives in integrations/otel as a separate module, so tracing support is available without adding OTel dependencies to the core library. It can record attempts as span events by default, or as child spans when you want per-attempt trace structure.
Observers receive lifecycle events, attempt events, hedge spawn events, and budget decision events.
The OpenTelemetry integration is provided as a separate module, so tracing support is available without adding OpenTelemetry dependencies to the root module.
The point is simple: recourse does not just retry. It tells you what it did.
Remote configuration fails explicitly¶
Once retry behavior is policy-driven, runtime policy updates become the obvious next step.
recourse supports a controlplane.RemoteProvider that fetches policies from an external source and caches them:
- TTL caching for fetched policies
- negative caching for not-found policies
- explicit fallback behavior through
MissingPolicyModewhen the source is unavailable
A control plane outage should not turn into mysterious retry behavior. Remote policy resolution has to fail visibly and deliberately.
Integrations should be useful without becoming a framework¶
Go libraries should not force a framework on users just to make a service call.
recourse keeps the root module light and pushes heavier dependencies into separate integration modules.
The integration philosophy is:
- target standard interfaces such as
net/httpand gRPC interceptors - keep optional dependencies optional
- handle correctness details that are easy to get wrong at every call site
- preserve observability through the same policy/timeline model
For example, the HTTP integration wraps transport errors and non-2xx responses as StatusError so HTTP-aware classifiers can interpret status codes and Retry-After. It also drains and closes failed response bodies so connections can be reused.
The gRPC and OpenTelemetry integrations live in separate modules so users can opt into their dependency surface deliberately.
Adoption should be incremental¶
You do not need a rewrite to use recourse.
A reasonable adoption path is:
- Start with one idempotent or otherwise retry-safe call site.
- Give it a stable low-cardinality key.
- Use the facade API with the default policy to observe behavior.
- Capture a timeline in tests or staging.
- Move to explicit executors and policies when the call site needs standardized behavior.
- Add budgets before scaling retries or hedging broadly.
- Introduce provider-backed policy only after the governance model is clear.
The call sites stay boring. The policy envelope becomes the place where reliability decisions live.
Closing thought¶
Retries are not a feature. They are an operational commitment.
recourse tries to make that commitment explicit: policy-keyed control, bounded envelopes, protocol-aware classification, backpressure, hedging, circuit breaking, and observability that tells the truth about what happened.
That is the model I want feedback on from Go teams: not whether another retry helper should exist, but whether retry behavior deserves to be governed as policy once it crosses service boundaries.
For a decision-focused intro, start with Design overview. Then see Getting started, Adoption guide, Gotchas, Incident debugging, and Migration from cenkalti/backoff.