Observability contract
Hook signatures
- Metrics:
on_metric(event: str, attempt: int, sleep_s: float, tags: Dict[str, Any]) - Logging:
on_log(event: str, fields: Dict[str, Any])(fields include attempt, sleep_s, tags)
Hook failures are swallowed so they never break the workload; log adapter errors separately if needed.
Events
success– call succeededretry– retry scheduled (includessleep_s)permanent_fail– non-retriable class (PERMANENT,AUTH,PERMISSION)deadline_exceeded– wall-clock deadline exceededmax_attempts_exceeded– global or per-class cap reachedmax_unknown_attempts_exceeded– UNKNOWN-specific cap reached
Attempts are 1-based. sleep_s is the scheduled delay for retries, otherwise 0.0.
Tags
operation– optional logical name provided by callerclass–ErrorClass.namewhen availableerr– exception class name when available
Avoid payloads or sensitive fields in tags; stick to identifiers.
Prometheus pattern
from redress.metrics import prometheus_metric_hook
policy.call(
lambda: do_work(),
on_metric=prometheus_metric_hook(counter),
operation="sync_user",
)
Counter should expose .labels(event=..., **tags).inc().
OTEL pattern (pseudo-code)
from redress.metrics import otel_metric_hook
policy.call(
lambda: do_work(),
on_metric=otel_metric_hook(meter, name="redress_attempts"),
operation="sync_user",
)
Meter counter should support .add(1, attributes=attributes).
Testing hooks
Quick pattern to assert hooks fire without needing real backends:
events = []
def metric(event, attempt, sleep_s, tags):
events.append((event, attempt, sleep_s, tags))
policy.call(work, on_metric=metric, operation="op")
assert any(e[0] == "retry" for e in events)
You can use the same shape for log hooks; ensure tests avoid networked backends and use local spies instead.
Tag cardinality guidance
- Keep tags low-cardinality (
class,operation,err); avoid per-user/request IDs. - For HTTP, prefer status classes (e.g., map 5xx) via
http_classifierinstead of embedding URLs. - For DB, map SQLSTATE classes via
sqlstate_classifierand avoid query text.
Structured logging example
import structlog
from redress import retry
logger = structlog.get_logger()
def log_hook(event: str, fields: dict[str, object]) -> None:
logger.info("retry_event", event=event, **fields)
@retry(on_log=log_hook, operation="sync_account")
def do_work():
...
OTEL counter example (pseudo-code)
from redress.metrics import otel_metric_hook
meter = ... # your OTEL meter
metric_hook = otel_metric_hook(meter, name="redress_events")
@retry(on_metric=metric_hook, operation="fetch_user")
def do_work():
...
Prometheus exporter sample
from prometheus_client import Counter, start_http_server
from redress.metrics import prometheus_metric_hook
from redress import retry
counter = Counter("redress_events", "Retry events", ["event", "class", "operation", "err"])
metric_hook = prometheus_metric_hook(counter)
start_http_server(8000)
@retry(on_metric=metric_hook, operation="sync_user")
def do_work():
...
Alerting ideas
- Rising
retryormax_attempts_exceededforRATE_LIMIT/SERVER_ERROR-> backoff/circuit breaker tuning. - Frequent
permanent_failwithAUTH/PERMISSION-> credential/config issues. deadline_exceededspikes -> deadline too low or upstream slowness.