The shape of the problem
A saga is a sequence of steps where each step has a compensating action. If any step fails, you run the compensations for the steps that already succeeded — in reverse order.
For a payment disbursement:
1. Reserve funds in source account → Compensation: release reservation
2. Create transaction record → Compensation: mark void
3. Call payment gateway → Compensation: refund (if it went through)
4. Update borrower's balance → Compensation: revert balance
5. Send notification → Compensation: send "cancelled" notification
Happy path: 1→2→3→4→5. Failure at step 4: compensate 3, compensate 2, compensate 1 (in that order).
The order of compensations matters
Forward order succeeded 1, 2, 3. Reverse compensation order is 3, 2, 1. Doing them in any other order produces incorrect state.
Why: compensating step 1 (release reservation) before step 3 (refund) might trigger a balance check that fails. Compensating step 2 before step 3 might leave the gateway call orphaned without a transaction record to associate the refund with.
The Go implementation
type Step struct {
Forward func(ctx context.Context, state *State) error
Compensate func(ctx context.Context, state *State) error
Name string
}
type Saga struct {
Steps []Step
}
func (s *Saga) Run(ctx context.Context, state *State) error {
var completed []int
defer func() {
if len(completed) == len(s.Steps) { return } // happy path
// Compensate in reverse order
for i := len(completed) - 1; i >= 0; i-- {
if err := s.Steps[completed[i]].Compensate(ctx, state); err != nil {
log.Error("compensation failed", "step", s.Steps[completed[i]].Name, "err", err)
// Record to operator queue; don't propagate (we're in defer)
}
}
}()
for i, step := range s.Steps {
if err := step.Forward(ctx, state); err != nil {
return fmt.Errorf("step %s failed: %w", step.Name, err)
}
completed = append(completed, i)
}
return nil
}
The defer does the compensation. The completed slice tracks which steps actually ran. The reverse iteration ensures the right order.
Compensation failures
A compensation can fail. The reservation-release fails because the source account was frozen. The refund fails because the gateway is down.
The pattern for handling this:
- Log the compensation failure to a dedicated table. Don’t propagate (you’re already in error path).
- Surface to operator review. A daily report; an on-call alert if it’s frequent.
- Idempotent compensations. When the operator retries, the compensation has to be safe to run again.
What I’ve seen go wrong
- Compensations that aren’t idempotent. A refund triggered twice refunds twice. Make compensations record their effect and short-circuit on duplicate calls.
- Compensations that depend on state the forward step modified. If the forward step deleted a row, the compensation has nothing to revert. Soft-delete during forward; hard-delete in a separate cleanup pass.
- Long-running compensations holding locks. A 5-minute compensation that holds a lock blocks other sagas. Compensations should be quick or async-emitted as their own jobs.
For Genie’s pkg/workflow saga implementation, these patterns are baked into the helper. The application defines the steps; the framework handles the reverse-order compensation and the failure-of-compensation queue.
The unhappy path is where the engineering shows. The happy path writes itself; the saga compensations are what you actually need to think about.