recourse¶

recourse (n.): a source of help or strength.

Policy-driven, observable resilience for Go services: retries, hedging, circuit breaking, and budgets.

New here? Start with Design overview, then move to Getting started.

Why `recourse`?¶

Retry logic is deceptively easy to write and notoriously hard to operate.

Resilience code drifts: each service ends up with slightly different retry semantics, timeouts, logging, and metrics.
Naive retries amplify outages: retries turn “a little latency” into “a lot more load” unless there is explicit backpressure.
Protocol/domain semantics matter: a timeout, a 429, and a 404 should not all be treated the same.
Debuggability is non-negotiable: when an incident happens, you need to answer “what happened on each attempt, and why?”.

recourse centralizes resilience behavior behind a low-cardinality policy key and makes every decision observable.

Concretely, recourse gives you:

Deterministic envelopes: bounded attempts, bounded backoff, and explicit timeouts.
Domain-aware retry decisions: pluggable classifiers (instead of “retry on any error”).
Backpressure: per-attempt budgets to prevent retry storms.
Structured observability: timelines and hooks that make behavior explainable in production.

The problem with ad-hoc retries¶

This is a common shape:

var lastErr error
for attempt := 0; attempt < 3; attempt++ {
    if err := callDependency(ctx); err == nil {
        return nil
    } else {
        lastErr = err
    }
    time.Sleep(time.Duration(attempt+1) * 50 * time.Millisecond)
}
return lastErr

But production requirements pile up quickly:

Is the error retryable (HTTP 500 vs 404 vs 429)? Should we treat context.Canceled differently?
Are we respecting per-attempt timeouts vs overall timeouts?
Are we emitting consistent logs/metrics/traces across all call sites?
Do we have backpressure so retries don’t turn an outage into a storm?
When this goes wrong at 3am, can we reconstruct the exact sequence of attempts and decisions?

What “policy-driven” means¶

In recourse, call sites supply a key (e.g., "payments.Charge"). Policies decide the retry envelope for that key:

maximum attempts
backoff/jitter
per-attempt and overall timeouts
classifier selection (how to interpret errors/results)
optional budgets/backpressure (whether to allow each attempt)

This makes behavior consistent, tunable, and observable without re-implementing retry loops everywhere.

Quick start¶

The facade API takes a string key like "user-service.GetUser":

package main

import (
    "context"

    "github.com/aponysus/recourse/recourse"
)

type User struct{ ID string }

func main() {
    user, err := recourse.DoValue[User](context.Background(), "user-service.GetUser", func(ctx context.Context) (User, error) {
        // call dependency here
        return User{ID: "123"}, nil
    })
    _ = user
    _ = err
}

When you need to know what happened, request a timeline:

ctx, capture := observe.RecordTimeline(ctx)
user, err := recourse.DoValue(ctx, "user-service.GetUser", op)
_ = user
_ = err

tl := capture.Timeline()
for _, a := range tl.Attempts {
    // a.Attempt, a.Outcome, a.BudgetAllowed, a.Backoff, a.Err, ...
}

Observability-first¶

Retries are only “safe” if they are observable.

recourse captures a structured observe.Timeline (attempt timings, outcomes, budget decisions, errors) and can also stream attempt/timeline events to your own logging/metrics/tracing via observe.Observer.

What’s inside¶

Policy keys: stable, low-cardinality keys ("svc.Method") that select behavior.
Policies + providers: policy.EffectivePolicy resolved via controlplane.PolicyProvider (today: in-process static).
Retry executor: bounded attempts with backoff/jitter and per-attempt/overall timeouts.
Classifiers: pluggable (value, err) → Outcome so retry decisions are protocol/domain-aware.
Budgets/backpressure: per-attempt gates to prevent retry storms (with optional release semantics).
Observability: structured observe.Timeline plus streaming hooks via observe.Observer.

Where to go next¶

Design overview – decision-first intro and tradeoffs.
Getting started – install and first examples.
Gotchas & safety checklist – avoid common operational failures.
Adoption guide – staged rollout plan.
Incident debugging – timeline-based runbook.
API compatibility policy – v1 stability contract.
Defaults and safety model – generated defaults and failure modes.
Policy schema reference – generated field reference.
Reason codes & timeline fields – generated reference.
Changelog – release history.
Concepts:
Policy keys
Key patterns and taxonomy
Policies & providers
Classifiers
Observability
Budgets & backpressure
Hedging
Circuit Breaking
Remote Configuration
Integrations
Architecture decisions:
ADR 001: Low-cardinality policy keys
ADR 003: Policy normalization
Extending – write custom classifiers/budgets/observers.