Test coverage and observability at Picnic
Coverage and observability are the unglamorous parts of an engineering org. They don’t get the headlines that performance wins do. But they’re the substrate that makes the headline wins safe. The Picnic team’s 80%+ coverage and Prometheus-everywhere approach is what turned a days-to-detect on-call into a minutes-to-detect one.
Where the team was before
The Picnic backend was Go microservices serving 1M+ users. When I joined the team:
- Coverage was around 35%.
- Prometheus was deployed but few services exported meaningful metrics.
- On-call rota existed but most pages started with a user report, not an automated alert.
- Detection time on real incidents was usually measured in days.
That last bullet was the loudest one. A user reports a problem on Monday; the team finds the root cause on Wednesday; the fix ships on Friday. The latency wasn’t acceptable, and the team knew it.
The coverage push
We didn’t push coverage as a number. The number is gameable — write tests that cover lines without checking behaviour and you hit any target. The push was on three specific shapes:
-
Integration tests for every gRPC handler. Each handler had a test that spun up a real Postgres (via testcontainers), wired the handler against it, and exercised every documented behaviour.
-
Contract tests for every protobuf message. A round-trip test that serialised and deserialised every documented field, with edge values (empty strings, nil pointers, max-length strings, unicode that breaks naive validators).
-
Failure injection tests. For every external dependency (database, downstream service), a test that simulated the dependency being slow, broken, or returning garbage. The handler had to respond gracefully.
The coverage number rose to ~80% as a side effect of those three shapes being thorough. The shapes mattered more than the number.
The Prometheus push
The team had a Prometheus deployment but uneven usage. We standardised on a small library that every service used:
metrics := mid.NewMetrics("user_profile")
http.Handle("/metrics", promhttp.Handler())
The library exposed:
<service>_http_requests_total{method, path, status}— counter<service>_http_request_duration_seconds{method, path}— histogram<service>_grpc_requests_total{method, status}— same shape, gRPC<service>_grpc_request_duration_seconds{method}— histogram<service>_dependency_calls_total{dep, status}— every external dep<service>_dependency_duration_seconds{dep}— same
That was it. Every service exported the same six families. Every service got the same Grafana dashboard rendered from a JSON template.
The dashboards
Three dashboards per service, generated from templates:
-
Service overview — request rate, error rate, latency percentiles, dependency health. The dashboard the on-call opens first.
-
Per-endpoint detail — broken down by URL path / gRPC method. Used when the overview shows a problem and you need to localise it.
-
Dependency detail — every external call’s latency and error rate. Used when the overview shows the service is slow and you suspect a downstream.
The templates were the magic. A new service deployed inherited all three dashboards on day one. The on-call experience for a new service was identical to the experience for a service that had been around for years.
Alerts that match incidents
The team’s previous alerting was noisy — alerts on raw CPU percentages, on individual error counts, on disk usage. Most alerts during my time on call were ignorable.
We replaced the alert set with SLO-derived alerts:
- For each service, define an SLO (e.g. 99.9% of requests succeed, 99% of requests complete in <500ms).
- Alert when the error budget burn rate is high enough to exhaust the budget before the next pager handoff.
The math is from Google’s SRE workbook. The implementation in Prometheus is straightforward (a recording rule for the burn rate, an alert when it exceeds a threshold for a window).
After the change, the alert volume dropped ~70%. Each remaining alert mapped to a real incident the team needed to act on.
The on-call experience after
The new flow:
- Alert fires (burn rate elevated on user-profile service).
- On-call opens the service overview dashboard.
- Dashboard shows latency spike on
/profile/get. - On-call clicks into the per-endpoint detail.
- Detail shows the dependency call to the user-accounts service is slow.
- On-call opens user-accounts service overview.
- Sees user-accounts has elevated DB query times.
- Roots into Spanner monitoring; finds a slow query.
The whole path is 5-10 minutes for a typical incident. Before, the same incident took the better part of a day because the operator was reconstructing the timeline from grepped logs.
What didn’t work
Two things we tried and abandoned:
-
Tracing as the primary observability tool. OpenTelemetry traces are valuable for the deep cases but expensive to instrument and noisy at scale. We added tracing later (after the metrics work) and used it as a debugging tool, not as the front-line view.
-
Logs as a metric source. We tried extracting metrics from structured logs via promtail. The cardinality blew up almost immediately; some logged fields had thousands of distinct values. We reverted to explicit Prometheus instrumentation and used logs only for forensic detail.
What you should take
If you’re on a team where on-call is reactive:
- Pick the six metric families. Make them identical across services. Build the template dashboards.
- Define SLOs. Replace your noisy alerts with burn-rate alerts.
- Push integration test coverage on every handler. The number doesn’t matter; the shape of the tests does.
The three together change the on-call rhythm in a quarter. The team that operates a system this way ships braver changes because the safety net is dense.
The Picnic team’s 47% latency win — the headline result — was only safe to attempt because the coverage and observability work had landed first. Without coverage, the protobuf migration would have broken something invisibly. Without dashboards, the consolidation would have shifted the bottleneck somewhere we couldn’t see. The boring infrastructure made the interesting work possible.