· 4 min read · ← All posts
OpenTelemetry Kubernetes Open Source Observability

Integrating OpenTelemetry into airshipit

airshipit is an open-source project for declarative bare-metal Kubernetes lifecycle management — used by telecom operators for things like 5G core deployment. Contributions come from Ericsson, AT&T, Microsoft, and others. I worked on the OpenTelemetry integration. The integration itself took weeks; making the traces usable across multi-vendor code took months.

The starting point

airshipit was a constellation of components: a CLI, a couple of controllers, the underlying provisioning workflow. Each emitted structured logs but no traces. When a deployment failed three hours in, the operator had to grep across components and reconstruct the timeline manually.

The brief: integrate OpenTelemetry so a single trace covers the end-to-end provisioning, surface that trace in the operator’s existing observability stack (varied by vendor — Grafana Tempo for some, Honeycomb for others, Lightstep for yet others), and reduce the manual reconstruction time.

What OTel gave us

The straightforward part: instrument each component with the OTel SDK, emit spans, propagate trace context across HTTP and Kubernetes events.

The result was the operator could pull one trace ID from the CLI output and see every operation across every component on the trace. The “what happened in those three hours” question moved from a grep exercise to a Jaeger/Tempo browse.

This bit was straightforward because OTel is well-designed. The work is mostly mechanical.

What OTel didn’t give us — context across foreign code

Multi-vendor OSS means components from different teams with different conventions. Ericsson’s component might emit spans named one way; AT&T’s another; Microsoft’s a third. Without coordination the traces were technically correct and operationally useless — nobody could read “namespace.k8s.apply.do” and “azureNetwork.x” and “ericsson_provision.attempt” together as one story.

We added two cross-cutting things:

  1. A shared span naming convention. A short document agreed across the vendors: spans named verb.subject (e.g. apply.cluster, wait.pod, provision.host), top-level attributes pulled from a fixed list, no PII in attributes.
  2. A semantic glossary. Each vendor’s contributors mapped their internal terms to the convention. “Host” meant the same thing everywhere; “node” meant something different; the glossary resolved it.

The convention took two video calls and one PR template; the glossary took longer because internal vocabularies are surprisingly load-bearing.

The propagation surprise

Trace context propagates over HTTP via the W3C traceparent header. That covers most distributed systems.

airshipit had Kubernetes events as a propagation boundary — one controller fires an event, another reacts. The trace context doesn’t ride along on a Kubernetes event by default.

We solved it with a custom annotation. The producing controller wrote airshipit.io/trace-context: 00-...-... on the related resource; the consuming controller read it and used it as the parent of its spans. The result: traces that crossed event boundaries cleanly.

This required two PRs across the affected controllers and a small update to the OTel propagator. The pattern transferred to other event-driven Kubernetes projects.

The 30% number

The operator team measured “time from deployment failure to root cause identified” before and after the OTel integration. The median dropped from ~45 minutes to ~14 minutes — a ~30% reduction in manual ops time on a per-incident basis, multiplied across the incident volume.

The bulk of the win wasn’t OTel per se. It was:

Without the naming work the traces would have been technically correct and operationally unloved.

What I’d do differently

Two things:

  1. Start the naming convention before the instrumentation. I did them in the wrong order. We had instrumentation across all components by the time the naming convention was ratified, and we had to go back and refactor. Six PRs that could have been none.

  2. Build the operator dashboards before declaring success. The integration was technically complete when traces flowed; it was operationally complete when the on-call had dashboards they actually checked. The dashboards lagged the integration by months; the perceived value of the work tracked the dashboard adoption, not the instrumentation date.

The contribution shape

Multi-vendor OSS contributions need a particular discipline:

The contribution rhythm was slower than internal work but the artefact had more reviewers, was used by more operators, and generally became more robust over time. Open source pays back, but the patience is real.

← Back to all posts