Observability contract
Hook signatures
- Metrics:
on_metric(event: str, attempt: int, sleep_s: float, tags: Dict[str, Any]) - Logging:
on_log(event: str, fields: Dict[str, Any])(fields include attempt, sleep_s, tags)
Hook failures are swallowed so they never break the workload; log adapter errors separately if needed.
Events
Event names are exported as redress.events.EventName (values shown below).
success– call succeededretry– retry scheduled (includessleep_s)permanent_fail– non-retriable class (PERMANENT,AUTH,PERMISSION)deadline_exceeded– wall-clock deadline exceededmax_attempts_exceeded– global or per-class cap reachedmax_unknown_attempts_exceeded– UNKNOWN-specific cap reachedno_strategy_configured– missing strategy for a retryable classbudget_exhausted– retry budget depletedscheduled– retry deferred by a sleep handleraborted– retry aborted via abort_if or AbortRetryErrorcircuit_opened– circuit breaker transitions to opencircuit_half_open– circuit breaker transitions to half-opencircuit_closed– circuit breaker transitions to closedcircuit_rejected– breaker rejected a call
Attempts are 1-based. sleep_s is the scheduled delay for retries, otherwise 0.0.
Breaker events use attempt=0 and sleep_s=0.0.
Abort events use the number of completed attempts (0 if aborted before the first).
For result-driven failures, err is omitted and cause="result" is included.
Stop reasons (terminal only)
Terminal events carry a stable stop_reason tag with a small, fixed set:
MAX_ATTEMPTS_GLOBALMAX_ATTEMPTS_PER_CLASSDEADLINE_EXCEEDEDMAX_UNKNOWN_ATTEMPTSNON_RETRYABLE_CLASSNO_STRATEGYBUDGET_EXHAUSTEDSCHEDULEDABORTEDABORTEDis emitted whenabort_ifrequests an abort or a sleep handler returnsSleepDecision.ABORT.
Tags
operation– optional logical name provided by callerclass–ErrorClass.namewhen availableerr– exception class name when availablestop_reason– terminal reason for stop events onlycause–"exception"or"result"when a failure triggers retries/stopsstate– circuit breaker state (closed,open,half_open) on breaker events
Avoid payloads or sensitive fields in tags; stick to identifiers.
Prometheus pattern
from redress.metrics import prometheus_metric_hook
policy.call(
lambda: do_work(),
on_metric=prometheus_metric_hook(counter),
operation="sync_user",
)
Counter should expose .labels(event=..., **tags).inc().
Prometheus contrib hook
from prometheus_client import Counter, Histogram
from redress.contrib.prometheus import prometheus_hooks
event_counter = Counter(
"redress_events_total",
"redress retry lifecycle events",
["event", "class", "operation"],
)
retry_sleep = Histogram(
"redress_retry_sleep_seconds",
"scheduled retry delays",
["class", "operation"],
)
hooks = prometheus_hooks(
events=event_counter,
retry_sleep_seconds=retry_sleep,
)
policy.call(lambda: do_work(), **hooks, operation="sync_user")
prometheus_hooks increments an event counter for every lifecycle event and,
when a histogram is provided, observes retry sleep durations for retry events.
Datadog contrib hook
from datadog import statsd
from redress.contrib.datadog import datadog_hooks
hooks = datadog_hooks(
statsd=statsd,
prefix="svc.redress",
constant_tags=["env:prod"],
)
policy.call(lambda: do_work(), **hooks, operation="sync_user")
datadog_hooks emits:
- {prefix}.events with tags like event:retry, class:RATE_LIMIT, operation:sync_user
- {prefix}.retry.sleep_seconds histograms for retry delays
Sentry contrib hook
import sentry_sdk
from redress.contrib.sentry import sentry_hooks
sentry_sdk.init(dsn="...")
hooks = sentry_hooks(sentry=sentry_sdk)
policy.call(lambda: do_work(), **hooks, operation="sync_user")
sentry_hooks turns hook log events into breadcrumbs and captures terminal
failure events as Sentry messages by default.
OpenTelemetry hooks
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
from redress.contrib.otel import otel_hooks
trace.set_tracer_provider(TracerProvider())
metrics.set_meter_provider(MeterProvider())
hooks = otel_hooks(
tracer=trace.get_tracer("redress"),
meter=metrics.get_meter("redress"),
)
policy.call(
lambda: do_work(),
**hooks,
operation="sync_user",
)
otel_hooks emits spans with attempt events plus metrics:
redress.retries, redress.retry.duration, redress.retry.success_after_retries,
and redress.circuit.state. Attributes include error.class, retry.attempt,
and operation. It requires opentelemetry-api (and the SDK if you set
providers directly).
Testing hooks
Quick pattern to assert hooks fire without needing real backends:
events = []
def metric(event, attempt, sleep_s, tags):
events.append((event, attempt, sleep_s, tags))
policy.call(work, on_metric=metric, operation="op")
assert any(e[0] == "retry" for e in events)
You can use the same shape for log hooks; ensure tests avoid networked backends and use local spies instead.
Tag cardinality guidance
- Keep tags low-cardinality (
class,operation,err); avoid per-user/request IDs. - For HTTP, prefer status classes (e.g., map 5xx) via
http_classifierinstead of embedding URLs. - For DB, map SQLSTATE classes via
sqlstate_classifierand avoid query text.
Structured logging example
import structlog
from redress import retry
logger = structlog.get_logger()
def log_hook(event: str, fields: dict[str, object]) -> None:
logger.info("retry_event", event=event, **fields)
@retry(on_log=log_hook, operation="sync_account")
def do_work():
...
OpenTelemetry metric-only hook
from redress.metrics import otel_metric_hook
meter = ... # your OTEL meter
metric_hook = otel_metric_hook(meter, name="redress_events")
@retry(on_metric=metric_hook, operation="fetch_user")
def do_work():
...
Prometheus exporter sample
from prometheus_client import Counter, start_http_server
from redress.metrics import prometheus_metric_hook
from redress import retry
counter = Counter("redress_events", "Retry events", ["event", "class", "operation", "err"])
metric_hook = prometheus_metric_hook(counter)
start_http_server(8000)
@retry(on_metric=metric_hook, operation="sync_user")
def do_work():
...
Alerting ideas
- Rising
retryormax_attempts_exceededforRATE_LIMIT/SERVER_ERROR-> backoff/circuit breaker tuning. - Frequent
permanent_failwithAUTH/PERMISSION-> credential/config issues. deadline_exceededspikes -> deadline too low or upstream slowness.