Incident debugging with timelines¶
This is a practical runbook for understanding what recourse did during a call. The goal is to answer: what happened on each attempt, and why did it stop.
Capture a timeline¶
Use observe.RecordTimeline at the call site you want to inspect:
package main
import (
"context"
"github.com/aponysus/recourse/observe"
"github.com/aponysus/recourse/recourse"
)
func main() {
ctx, capture := observe.RecordTimeline(context.Background())
_ = recourse.Do(ctx, "user-service.GetUser", func(ctx context.Context) error {
return nil
})
tl := capture.Timeline()
_ = tl
}
If you need streaming events for logs or metrics, implement observe.Observer and attach it in the executor options. See Observability.
Triage checklist¶
Start with the basics:
- How many attempts ran? Check
len(tl.Attempts). - What was the final error? Inspect
tl.FinalErrand compare it to the last attempt error. - Why did each attempt stop? Inspect
AttemptRecord.Outcome.KindandAttemptRecord.Outcome.Reason.
Then dig into common failure modes:
- Backoff and timing: Compare
AttemptRecord.Backoffto the time between attempts. - Budgets: Check
AttemptRecord.BudgetAllowedandAttemptRecord.BudgetReason. If you use an observer, theBudgetDecisionEventwill include the mode and reason. - Hedging: Look for
AttemptRecord.IsHedgeandAttemptRecord.HedgeIndexto see which attempts were hedges. - Circuit breaking: Inspect
AttemptRecord.ErrandAttemptRecord.Outcome.Reasonfor signals that the circuit short-circuited the call. - Policy resolution: Inspect
tl.Attributesfor provider and normalization metadata when present.
Questions you should be able to answer¶
- Did the call retry, or stop after the first attempt?
- Was the stop due to classification, budgets, circuit state, or timeouts?
- Did hedging improve latency or just add load?
- Were any attempts denied before they ran?
Suggested fields to log¶
For logs and metrics, capture a small, low-cardinality set of fields:
key(policy key)attemptoutcome.kindandoutcome.reasonbackoff_msbudget_allowedandbudget_reasonis_hedgeduration_ms
Keep labels low-cardinality and avoid embedding IDs in keys or attributes.