Karla — AI Voice Agent
Production voice AI agent for a Mexican logistics company. Karla calls (and receives calls from) truck drivers to validate cargo trip data — confirming matches, updating discrepancies, or registering new trips — and writes back to the client's operations system in real time. Speaks colloquial Mexican Spanish, keeps turns under two sentences, and warm-transfers to a human agent at any point.
The Use Case
What Karla actually does
Every cargo trip leg ends with the same task: someone has to call the driver, verify the trip data, and update the operations system. Multiply by tens of thousands of legs per month and the client was burning a contact-center team on a flow that's 90% scripted. Karla replaces that flow with three deterministic scenarios — and escalates to a human the moment things go off-script.
✓ Confirm
The driver's data matches the system. Karla confirms and closes the call.
validar_viaje
✎ Update
There's a discrepancy (origin / destination / date). Karla updates the system.
actualizar_viaje
+ Create
The trip doesn't exist in the system. Karla walks the driver through registering it.
crear_viaje
Warm transfer to a human at any moment — native RingCentral hand-off,
no SIP bridge needed. Tool: transferir_a_humano.
Architecture
How the pipeline is wired
A FastAPI middleware orchestrates the voice pipeline: RingCentral provides the audio leg, Deepgram transcribes in streaming, Claude Haiku 4.5 decides the action via tool calling, ElevenLabs synthesizes the response. The actions (consult / update / create trip) execute against the client's API.
Client operations system
│
│ trigger (driver_id, trip_id, phone)
▼
┌────────────────────────┐
│ Middleware (FastAPI) │
└────────────────────────┘
│
│ outbound call
▼
┌────────────────────────┐ ┌───────────────────┐
│ RingCentral telephony │ ◀─────▶ │ Driver (mobile) │
└────────────────────────┘ └───────────────────┘
│ PCM16 8kHz mono · WebSocket
▼
┌───────────────────────────────────────────────────────┐
│ Voice Pipeline │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Deepgram │─▶│ Claude │─▶│ ElevenLabs │ │
│ │ Nova-2 (es) │ │ Haiku 4.5 │ │ Flash v2.5 │ │
│ │ STT │ │ LLM + tools │ │ TTS │ │
│ └──────────────┘ └──────────────┘ └────────────┘ │
└───────────────────────────────────────────────────────┘
│
│ GET / PUT / POST /trips
▼
┌────────────────────────┐
│ Client API │
└────────────────────────┘
| Layer | Choice | Why |
|---|---|---|
| Telephony | RingCentral | Already in production at the client's contact center. Native warm-transfer to a human agent, no SIP bridge required. |
| STT | Deepgram Nova-2 | Best WER for colloquial Mexican Spanish. Streaming over WebSocket, <200ms latency. |
| LLM | Claude Haiku 4.5 | Native tool calling, more than enough for a 3-scenario FSM. Same provider as the dev environment. |
| TTS | ElevenLabs Flash v2.5 | ~75ms TTFB, ~50% cheaper than Turbo, quality good enough for MX-Spanish contact-center voice. |
| Backend | Python 3.12 + FastAPI | asyncio for concurrent WebSockets; first-class AI SDKs. |
| Queue | ARQ + Redis | Async-native, fits the FastAPI / asyncio model. |
| DB | PostgreSQL 16 | Call logs, analytics, audit trail. |
| Infra | AWS ECS + Docker | No Kubernetes for go-live — unnecessary overhead for a 6-week project. Future migration if the business case appears. |
Modules
Backend layout
Inventory of app/ modules. Each has a clearly bounded responsibility
and a paired test suite.
STT (Deepgram Nova-2 streaming) → LLM (Claude Haiku 4.5 with 4 tools: validate / update / create / transfer) → TTS (ElevenLabs Flash v2.5).
HTTP client for the client API (GET / PUT / POST /trips). Today against a local mock; waiting on the real schema from the client.
CallSession with a finite-state machine (greeting → validating → updating / creating → closing). Per-call state and context held in Redis.
Outbound dispatcher + scheduler with a configurable calling window and retry logic (no answer → 30 min, max 3 attempts).
Postgres repos for the calls table (full audit trail: transcripts + tool calls) and campaigns.
Bearer auth on /api/calls* and /metrics. API_TOKEN validated at startup. mypy strict in auth helpers.
Token-bucket rate limiter (Redis) + circuit breaker for the client API + phone-number validation.
Prometheus /metrics endpoint with STT/LLM/TTS latency histograms, HTTP request counters, and ToolDispatcher readiness counters.
Security & Observability
Production posture
Hardening pass driven by a recent threat review, plus Prometheus-shaped metrics for alerts and dashboards.
Security
- Bearer auth on
/api/calls*and/metrics trip_idpinned server-side inToolDispatcher(prevents tool-arg injection from the LLM)- Redis auth mandatory in both dev and prod compose
- Postgres + Redis in dev bound to
127.0.0.1only - Startup warning if
API_TOKENis empty; bearer parsing hardened - mypy strict on auth helpers
Observability
- STT / LLM / TTS latency histograms (p50, p95, p99)
- HTTP request count + latency middleware
- Readiness counters in
ToolDispatcher - Prometheus-format
/metricsendpoint - E2E smoke test verifies
/metricsreflects pipeline activity
PoC vs Production
Same architecture, different knobs
The PoC is deployed and shows the full conversational flow against a mock of the client API — useful for letting the client experience Karla without telephony or sandbox access. The architecture is identical to production; only the model and transport configs change at the boundary, so migration is a config diff rather than a rewrite.
| Layer | PoC | Production | Reason |
|---|---|---|---|
| STT | Groq Whisper v3 Turbo | Deepgram Nova-2 | Whisper is cheap and fast for the PoC; Deepgram has better WER on MX-Spanish for production. |
| LLM | Gemini 2.5 Flash | Claude Haiku 4.5 | Aligns prod with the dev environment (Anthropic), native tool calling. |
| TTS | ElevenLabs Turbo v2.5 | ElevenLabs Flash v2.5 | Flash is ~50% cheaper at equivalent quality. |
| Transport | Browser WebSocket | RingCentral PCM16 8kHz | Production rides the call center's existing telephony. |
| API target | In-memory mock | Real client API | PoC has no external dependency. Waiting on the client sandbox. |
Operating Costs
Unit economics at 200K calls / month
Estimate at 200,000 calls / month (~250,000 minutes). Sticker pricing, no enterprise discounts. With negotiable discounts (Deepgram + ElevenLabs) the monthly total drops to roughly $3,300-3,500.
| Component | Monthly |
|---|---|
| Deepgram Nova-2 (STT) | ~$1,475 |
| RingCentral (incremental) | ~$950 |
| Claude Haiku 4.5 (LLM) | ~$700 |
| AWS ECS infrastructure | ~$550 |
| ElevenLabs Flash v2.5 (TTS) | ~$250 |
| Total | ~$3,925 |
Engineering Notes
Decisions worth calling out
- Tool-arg pinning. The
trip_idis set by the server when the call is dispatched, not trusted from the LLM's tool args. The LLM can decide which tool to call, but not which trip to mutate. A small choice that closes a whole class of prompt-injection attacks. - FSM in Redis. Each call is a tiny state machine (greeting → validating → updating / creating → closing). Putting state in Redis instead of in-LLM-context means a process restart mid-call doesn't lose the call — the worker reattaches and the FSM resumes.
- Provider parity between PoC and prod. The PoC runs on cheaper providers (Groq Whisper, Gemini Flash, ElevenLabs Turbo) to keep PoC costs near zero. The architecture is identical — only model and transport configs change at the boundary. Migration to prod is a config diff, not a rewrite.
- Warm transfer over SIP bridging. A native RingCentral transfer to a human agent is one API call. Building SIP bridging would have cost a week and added a failure mode. When the existing telephony already does it, use that.
- No Kubernetes for go-live. ECS + Docker is enough for 6 weeks to production. The cost of platform sophistication ahead of demand is invisible until you're debugging an autoscaler at 2 AM instead of tuning prompts.