What gets measured gets managed. What doesn’t get measured gets ignored.
At 30K TPS (Globe), 1M+ users (Picnic), or millions of customers (Bancnet), you can’t manually debug. You need observability: structured logging, distributed tracing, and metrics that automatically alert you to problems.
| Cross-reference: High-throughput systems | Latency optimization | Cloud Spanner | Kubernetes deployment | GDPR audit trails |
The difference between finding a bug in minutes vs. hours vs. days is observability.
System health: “How many requests per second? What’s the P99 latency?”
from prometheus_client import Counter, Histogram, Gauge
# Counter: total transactions
transactions_total = Counter(
'transactions_total',
'Total transactions processed',
['status', 'type']
)
# Histogram: latency distribution
transaction_latency = Histogram(
'transaction_latency_seconds',
'Transaction processing latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
# Gauge: current active connections
active_connections = Gauge(
'active_connections',
'Currently active database connections'
)
# Usage
with transaction_latency.time():
process_transaction()
transactions_total.labels(status='success', type='payment').inc()
What happened: “Request X from customer Y failed because Z.”
import structlog
# Structured logging (not "Request failed")
log = structlog.get_logger()
log.info(
"transaction_completed",
transaction_id="TXN-123",
customer_id="CUST-456",
amount=1000.50,
status="success",
latency_ms=125,
)
How did the request flow: “Request entered Service A → called Service B → queried Database → returned in 200ms.”
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_transaction(transaction_id):
with tracer.start_as_current_span("process_transaction") as span:
span.set_attribute("transaction_id", transaction_id)
# Nested span: validate
with tracer.start_as_current_span("validate") as validate_span:
validate_span.set_attribute("status", "ok")
# Nested span: execute
with tracer.start_as_current_span("execute") as exec_span:
exec_span.set_attribute("status", "ok")
# Nested span: audit
with tracer.start_as_current_span("audit_log"):
pass
return "success"
# Trace flow: process_transaction → validate (50ms) → execute (100ms) → audit_log (10ms)
# Prometheus alert rules
groups:
- name: application
rules:
# Alert if P99 latency exceeds threshold
- alert: HighLatency
expr: histogram_quantile(0.99, transaction_latency_seconds) > 0.5
for: 5m
action: page
# Alert if error rate > 1%
- alert: HighErrorRate
expr: rate(transactions_total{status="error"}[5m]) / rate(transactions_total[5m]) > 0.01
for: 5m
action: page
# Alert if database connections approaching limit
- alert: HighDatabaseConnections
expr: active_connections / max_connections > 0.8
for: 5m
action: warn
Tags: #Observability #Monitoring #Tracing #Prometheus #OpenTelemetry
Published: June 2026
Author: Pratik Dhanave
Related Projects: Globe, Picnic, Bodh