The PoC
Kinetic India two-wheeler rider tap a button on the handlebar. The voice assistant handles vehicle diagnostics (“what’s my mileage?”), service booking (“book service for next week”), telemetry queries (“what was my top speed yesterday?”).
The constraints:
- Multi-language: Hindi, Marathi, English at minimum.
- Latency: under 3 seconds end-to-end (rider can’t wait while in motion).
- Robust to noise: traffic noise, wind, helmet muffling.
- Cost: cents per interaction, not dollars.
The pipeline
Mic input (16kHz mono, push-to-talk)
│
▼
ASR (Whisper / Bhashini) ← language detect + transcribe
│
▼
Intent classification (small LLM) ← "diagnostics" / "service" / "telemetry"
│
▼
Tool dispatch (Go) ← fetch from vehicle API
│
▼
Response generation (small LLM) ← turn JSON into natural language in the right language
│
▼
TTS (ElevenLabs) ← Hindi / Marathi / English voice
│
▼
Audio playback through handlebar speaker
Five stages. Each stage has a latency budget. Total: 2.5s p95.
The latency budget
| Stage | Budget |
|---|---|
| ASR | 600ms |
| Intent | 200ms |
| Tool dispatch | 400ms |
| Response gen | 800ms |
| TTS | 500ms |
| Total | 2.5s |
To hit this:
- ASR runs on-device for short utterances (under 2 sec), Bhashini cloud for longer.
- Intent is a fine-tuned tiny model, fits in 50MB on the device.
- Tool dispatch caches per-vehicle data; first call hits the cloud, subsequent for the same datum hit the device cache.
- Response gen uses a small cloud LLM with response streaming — TTS starts before full generation completes.
- TTS streams audio chunks; playback starts within ~200ms of TTS request.
Multi-language patterns
Detect language in ASR. Whisper’s multilingual model returns the detected language as part of the output. Use that to route TTS to the matching voice.
Generate in the detected language directly. Don’t translate. The small LLM is prompted with examples in all three languages; it generates in whatever the user spoke.
Voice cloning per language is overkill. Different voice IDs per language (Hindi female, Marathi female, English female) is enough for the rider’s experience. Cloning the same voice across languages is a higher-fidelity goal; not necessary for the PoC.
ElevenLabs in production
ElevenLabs gives high-quality TTS with streaming. Three patterns that mattered:
- Stream first chunk fast. The first audio bytes should arrive within 200ms of the TTS request. Configure
optimize_streaming_latency=3(highest priority on latency). - Cache common responses. “Your fuel is at 65%” is generated thousands of times; cache the audio bytes keyed by (text, voice ID).
- Cost per minute matters. Premium voices cost more. Pick the cheapest voice that meets the quality bar; switch only if user feedback demands it.
What broke
- Helmet muffling. First version’s ASR misheard most queries because the user’s voice was 6dB lower than expected. Fix: per-rider voice profile calibration on first use.
- Wind noise above 60 km/h. Noise cancellation in the audio pipeline was insufficient. Fix: short hardware delay on the mic input to capture a noise sample, subtract it.
- Hindi/Marathi code-mixing. Riders say “kya hai mileage” (mix of Hindi and English). The ASR transcribed it; the intent classifier was confused. Fix: train the classifier on code-mixed data.
What I’d carry forward
For voice AI in operational environments (vehicles, factory floors, kitchens):
- Latency is the dominant UX axis. Tune for it before tuning for accuracy.
- Multi-language is detect-and-route, not detect-and-translate.
- Hardware integration (mic, speaker, button) dominates the “first prototype” timeline.
- ElevenLabs + Bhashini + a small cloud LLM is a credible end-to-end stack at the PoC level.
The Kinetic PoC ran for three months; the patterns above transferred cleanly to a separate fleet-operator voice deployment. The voice stack patterns are durable; the per-domain tools differ.