recourse¶
recourse (n.): a source of help or strength.
Policy-driven, observable resilience for Go services: retries, hedging, circuit breaking, and budgets.
New here? Start with Design overview, then move to Getting started.
Why recourse?¶
Retry logic is deceptively easy to write and notoriously hard to operate.
- Resilience code drifts: each service ends up with slightly different retry semantics, timeouts, logging, and metrics.
- Naive retries amplify outages: retries turn “a little latency” into “a lot more load” unless there is explicit backpressure.
- Protocol/domain semantics matter: a timeout, a 429, and a 404 should not all be treated the same.
- Debuggability is non-negotiable: when an incident happens, you need to answer “what happened on each attempt, and why?”.
recourse centralizes resilience behavior behind a low-cardinality policy key and makes every decision observable.
Concretely, recourse gives you:
- Deterministic envelopes: bounded attempts, bounded backoff, and explicit timeouts.
- Domain-aware retry decisions: pluggable classifiers (instead of “retry on any error”).
- Backpressure: per-attempt budgets to prevent retry storms.
- Structured observability: timelines and hooks that make behavior explainable in production.
The problem with ad-hoc retries¶
This is a common shape:
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
if err := callDependency(ctx); err == nil {
return nil
} else {
lastErr = err
}
time.Sleep(time.Duration(attempt+1) * 50 * time.Millisecond)
}
return lastErr
But production requirements pile up quickly:
- Is the error retryable (HTTP 500 vs 404 vs 429)? Should we treat
context.Canceleddifferently? - Are we respecting per-attempt timeouts vs overall timeouts?
- Are we emitting consistent logs/metrics/traces across all call sites?
- Do we have backpressure so retries don’t turn an outage into a storm?
- When this goes wrong at 3am, can we reconstruct the exact sequence of attempts and decisions?
What “policy-driven” means¶
In recourse, call sites supply a key (e.g., "payments.Charge"). Policies decide the retry envelope for that key:
- maximum attempts
- backoff/jitter
- per-attempt and overall timeouts
- classifier selection (how to interpret errors/results)
- optional budgets/backpressure (whether to allow each attempt)
This makes behavior consistent, tunable, and observable without re-implementing retry loops everywhere.
Quick start¶
The facade API takes a string key like "user-service.GetUser":
package main
import (
"context"
"github.com/aponysus/recourse/recourse"
)
type User struct{ ID string }
func main() {
user, err := recourse.DoValue[User](context.Background(), "user-service.GetUser", func(ctx context.Context) (User, error) {
// call dependency here
return User{ID: "123"}, nil
})
_ = user
_ = err
}
When you need to know what happened, request a timeline:
ctx, capture := observe.RecordTimeline(ctx)
user, err := recourse.DoValue(ctx, "user-service.GetUser", op)
_ = user
_ = err
tl := capture.Timeline()
for _, a := range tl.Attempts {
// a.Attempt, a.Outcome, a.BudgetAllowed, a.Backoff, a.Err, ...
}
Observability-first¶
Retries are only “safe” if they are observable.
recourse captures a structured observe.Timeline (attempt timings, outcomes, budget decisions, errors) and can also stream attempt/timeline events to your own logging/metrics/tracing via observe.Observer.
What’s inside¶
- Policy keys: stable, low-cardinality keys (
"svc.Method") that select behavior. - Policies + providers:
policy.EffectivePolicyresolved viacontrolplane.PolicyProvider(today: in-process static). - Retry executor: bounded attempts with backoff/jitter and per-attempt/overall timeouts.
- Classifiers: pluggable
(value, err) → Outcomeso retry decisions are protocol/domain-aware. - Budgets/backpressure: per-attempt gates to prevent retry storms (with optional release semantics).
- Observability: structured
observe.Timelineplus streaming hooks viaobserve.Observer.
Where to go next¶
- Design overview – decision-first intro and tradeoffs.
- Getting started – install and first examples.
- Gotchas & safety checklist – avoid common operational failures.
- Adoption guide – staged rollout plan.
- Incident debugging – timeline-based runbook.
- API compatibility policy – v1 stability contract.
- Policy schema reference – generated field reference.
- Reason codes & timeline fields – generated reference.
- Changelog – release history.
- Concepts:
- Policy keys
- Policies & providers
- Classifiers
- Observability
- Budgets & backpressure
- Hedging
- Circuit Breaking
- Remote Configuration
- Integrations
- Architecture decisions:
- ADR 001: Low-cardinality policy keys
- ADR 003: Policy normalization
- Extending – write custom classifiers/budgets/observers.