Error-code orchestration
Globe ran transactions across multiple telco/FinTech partner integrations. As the partner count grew, every worker’s error handling grew with it — long switches over HTTP status codes, partner-specific edge cases, copy-pasted retry logic. The refactor that fixed it: a typed error-code enum with adapters mapping each partner’s specifics into the shared set.
The before state
// worker.go — the part nobody wanted to maintain
switch resp.StatusCode {
case 200, 201:
return nil
case 400:
if strings.Contains(body, "INSUF_FUND") {
return errInsufficientFunds
}
if strings.Contains(body, "INVALID_ACCT") {
return errInvalidAccount
}
return errBadRequest
case 401, 403:
return errAuth
case 408, 504:
return errTimeout
case 429:
// partner X uses 429 for rate limit; partner Y uses 429 for
// something else; check the body to disambiguate
if isRateLimit(body) {
return errRateLimit
}
return errOther
case 500, 502, 503:
return errPartnerOutage
default:
return errOther
}
This was the short version. The actual switch was 80+ cases across 12 partners. Adding a new partner meant grepping for the existing switch in 7 services and patching each.
Worse: the orchestration logic — what to retry, what to DLQ, what to page on — depended on the error. The orchestration was scattered across the same switch. A new partner with unusual semantics meant changing orchestration in 7 places.
The refactor
Introduce a typed enum of error codes. Push the partner-specific logic into per-partner adapters. The orchestration depends only on the enum.
type ErrorCode int
const (
ErrTransient ErrorCode = iota // retry with backoff
ErrInsufficientFunds // notify, no retry
ErrInvalidAccount // DLQ, operator review
ErrPartnerOutage // back off harder
ErrRateLimit // wait per partner's policy
ErrAuth // page; partner credentials broken
ErrFatal // page; bug
ErrDuplicate // treat as success
// ... about 20 codes total ...
)
type PartnerError struct {
Code ErrorCode
Message string
Retry *time.Duration // override default backoff if non-nil
Raw string // for forensics
}
Each partner has an adapter:
// partners/x/errors.go
func (a *XAdapter) MapError(resp *http.Response, body string) *PartnerError {
switch resp.StatusCode {
case 200, 201:
return nil
case 400:
switch {
case strings.Contains(body, "INSUF_FUND"):
return &PartnerError{Code: ErrInsufficientFunds, Message: "insufficient funds", Raw: body}
case strings.Contains(body, "INVALID_ACCT"):
return &PartnerError{Code: ErrInvalidAccount, Message: "invalid account", Raw: body}
}
return &PartnerError{Code: ErrTransient, Message: "bad request", Raw: body}
case 429:
retry := parseRetryAfter(resp)
return &PartnerError{Code: ErrRateLimit, Retry: &retry, Raw: body}
case 500, 502, 503:
return &PartnerError{Code: ErrPartnerOutage, Raw: body}
// ... etc ...
}
return &PartnerError{Code: ErrTransient, Raw: body}
}
The orchestration is now:
// orchestrator.go — short, doesn't change per partner
func dispatch(err *PartnerError) Action {
switch err.Code {
case ErrTransient, ErrPartnerOutage:
return Action{Retry: true, Backoff: backoffForCode(err.Code, err.Retry)}
case ErrRateLimit:
return Action{Retry: true, Backoff: *err.Retry}
case ErrInsufficientFunds, ErrInvalidAccount:
return Action{DLQ: true}
case ErrAuth, ErrFatal:
return Action{Page: true}
case ErrDuplicate:
return Action{Success: true}
}
return Action{DLQ: true}
}
Adding a new partner: write partners/newpartner/errors.go. The
orchestrator doesn’t change. The retry tables don’t change. The
on-call playbooks don’t change.
What this unlocked
Three concrete wins beyond “the code is cleaner”:
-
Per-partner SLA tracking. Now that errors are typed, we can say “partner X has a 2% ErrPartnerOutage rate this week.” Before, the metric was “partner X returned non-200 2% of the time,” which conflated three different problems.
-
Test fixtures. A worker test that needed to simulate “partner returned insufficient funds” became
&PartnerError{Code: ErrInsufficientFunds}— no fake HTTP response, no body parsing. Test count went up; test maintenance went down. -
New partner onboarding. The first version of a new partner adapter could be a stub that returned
ErrTransientfor everything. The partner integration team could run end-to-end tests immediately and refine the error mapping as they discovered the partner’s semantics. The previous shape required the full mapping up-front because adding a case meant editing every worker.
The migration
The migration to the new shape was a multi-week effort across seven services. The discipline that made it safe:
- Strangler-fig pattern. Old switch stayed. New adapter shipped alongside it. A feature flag decided which one fed the orchestrator.
- One partner at a time. Migrate partner A’s adapter, flip the flag for partner A only, observe for a week, repeat.
- Side-by-side comparison in test. During the migration, every test ran both code paths and compared. Discrepancies meant the new adapter missed a case.
Six weeks end to end. About 4,000 lines of code touched. Zero production incidents during the migration.
What I’d watch out for
The pattern has two failure modes if you’re not careful:
-
Enum sprawl. Twenty error codes is fine. Eighty is a different switch problem with a different shape. Resist the urge to add codes for nuance; if the orchestration treats two codes the same way, they should be the same code.
-
Adapter drift. Two partners with similar errors should map to the same code. If partner A maps
INSUF_FUNDtoErrInsufficientFundsand partner B mapsLOW_BALto a newErrLowBalance, the orchestration starts caring about the distinction. Code review for adapter PRs should catch this.
What transfers
Any system with multiple downstream providers benefits from the same shape:
- LLM providers (OpenAI vs Anthropic vs Bedrock return different
errors; route them through
ErrRateLimit,ErrModelOverloaded,ErrContextTooLong, etc.). - Payment gateways (Stripe vs Razorpay vs PayU each have their own shape; the orchestration shouldn’t care).
- Cloud APIs (AWS vs GCP vs Azure errors; the multi-cloud client benefits enormously from the pattern).
The first time you write the enum, it feels like over-engineering. The first time you add a new provider in 30 minutes, it pays for itself.