genix

Observability at Scale: From Manual Debugging to Automated Detection

What gets measured gets managed. What doesn’t get measured gets ignored.

At 30K TPS (Globe), 1M+ users (Picnic), or millions of customers (Bancnet), you can’t manually debug. You need observability: structured logging, distributed tracing, and metrics that automatically alert you to problems.

Cross-reference: High-throughput systems

Latency optimization

Cloud Spanner

Kubernetes deployment

GDPR audit trails

The difference between finding a bug in minutes vs. hours vs. days is observability.

The Three Pillars

1. Metrics (Numbers)

System health: “How many requests per second? What’s the P99 latency?”

from prometheus_client import Counter, Histogram, Gauge

# Counter: total transactions
transactions_total = Counter(
    'transactions_total',
    'Total transactions processed',
    ['status', 'type']
)

# Histogram: latency distribution
transaction_latency = Histogram(
    'transaction_latency_seconds',
    'Transaction processing latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

# Gauge: current active connections
active_connections = Gauge(
    'active_connections',
    'Currently active database connections'
)

# Usage
with transaction_latency.time():
    process_transaction()
    transactions_total.labels(status='success', type='payment').inc()

2. Logs (Events)

What happened: “Request X from customer Y failed because Z.”

import structlog

# Structured logging (not "Request failed")
log = structlog.get_logger()
log.info(
    "transaction_completed",
    transaction_id="TXN-123",
    customer_id="CUST-456",
    amount=1000.50,
    status="success",
    latency_ms=125,
)

3. Traces (Execution Paths)

How did the request flow: “Request entered Service A → called Service B → queried Database → returned in 200ms.”

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_transaction(transaction_id):
    with tracer.start_as_current_span("process_transaction") as span:
        span.set_attribute("transaction_id", transaction_id)
        
        # Nested span: validate
        with tracer.start_as_current_span("validate") as validate_span:
            validate_span.set_attribute("status", "ok")
        
        # Nested span: execute
        with tracer.start_as_current_span("execute") as exec_span:
            exec_span.set_attribute("status", "ok")
        
        # Nested span: audit
        with tracer.start_as_current_span("audit_log"):
            pass
        
        return "success"

# Trace flow: process_transaction → validate (50ms) → execute (100ms) → audit_log (10ms)

Alerting: Detect Before Users Do

# Prometheus alert rules
groups:
  - name: application
    rules:
    # Alert if P99 latency exceeds threshold
    - alert: HighLatency
      expr: histogram_quantile(0.99, transaction_latency_seconds) > 0.5
      for: 5m
      action: page

    # Alert if error rate > 1%
    - alert: HighErrorRate
      expr: rate(transactions_total{status="error"}[5m]) / rate(transactions_total[5m]) > 0.01
      for: 5m
      action: page
    
    # Alert if database connections approaching limit
    - alert: HighDatabaseConnections
      expr: active_connections / max_connections > 0.8
      for: 5m
      action: warn

Production Checklist

Metrics collected at service, database, and infrastructure levels
Logs structured (JSON, not strings)
Traces sampled (10% of requests in prod, 100% in staging)
Alerts for P99 latency, error rate, capacity
Dashboards for real-time system health
On-call rotation with runbooks for each alert

Tags: #Observability #Monitoring #Tracing #Prometheus #OpenTelemetry

Published: June 2026
Author: Pratik Dhanave
Related Projects: Globe, Picnic, Bodh