·3 min read·← All posts
Voice AI ElevenLabs Multi-Language Bhashini

The PoC

Kinetic India two-wheeler rider tap a button on the handlebar. The voice assistant handles vehicle diagnostics (“what’s my mileage?”), service booking (“book service for next week”), telemetry queries (“what was my top speed yesterday?”).

The constraints:

The pipeline

Mic input (16kHz mono, push-to-talk)
    │
    ▼
ASR (Whisper / Bhashini)         ← language detect + transcribe
    │
    ▼
Intent classification (small LLM)   ← "diagnostics" / "service" / "telemetry"
    │
    ▼
Tool dispatch (Go)                  ← fetch from vehicle API
    │
    ▼
Response generation (small LLM)     ← turn JSON into natural language in the right language
    │
    ▼
TTS (ElevenLabs)                    ← Hindi / Marathi / English voice
    │
    ▼
Audio playback through handlebar speaker

Five stages. Each stage has a latency budget. Total: 2.5s p95.

The latency budget

Stage Budget
ASR 600ms
Intent 200ms
Tool dispatch 400ms
Response gen 800ms
TTS 500ms
Total 2.5s

To hit this:

Multi-language patterns

Detect language in ASR. Whisper’s multilingual model returns the detected language as part of the output. Use that to route TTS to the matching voice.

Generate in the detected language directly. Don’t translate. The small LLM is prompted with examples in all three languages; it generates in whatever the user spoke.

Voice cloning per language is overkill. Different voice IDs per language (Hindi female, Marathi female, English female) is enough for the rider’s experience. Cloning the same voice across languages is a higher-fidelity goal; not necessary for the PoC.

ElevenLabs in production

ElevenLabs gives high-quality TTS with streaming. Three patterns that mattered:

  1. Stream first chunk fast. The first audio bytes should arrive within 200ms of the TTS request. Configure optimize_streaming_latency=3 (highest priority on latency).
  2. Cache common responses. “Your fuel is at 65%” is generated thousands of times; cache the audio bytes keyed by (text, voice ID).
  3. Cost per minute matters. Premium voices cost more. Pick the cheapest voice that meets the quality bar; switch only if user feedback demands it.

What broke

What I’d carry forward

For voice AI in operational environments (vehicles, factory floors, kitchens):

The Kinetic PoC ran for three months; the patterns above transferred cleanly to a separate fleet-operator voice deployment. The voice stack patterns are durable; the per-domain tools differ.

← Back to all posts