Project · Technical Deep Dive

Karla — AI Voice Agent

Production voice AI agent for a Mexican logistics company. Karla calls (and receives calls from) truck drivers to validate cargo trip data — confirming matches, updating discrepancies, or registering new trips — and writes back to the client's operations system in real time. Speaks colloquial Mexican Spanish, keeps turns under two sentences, and warm-transfers to a human agent at any point.

200K
Calls / month
~$0.020
Cost / call
<1.5s
Pipeline latency
3
Scenarios + escalation

What Karla actually does

Every cargo trip leg ends with the same task: someone has to call the driver, verify the trip data, and update the operations system. Multiply by tens of thousands of legs per month and the client was burning a contact-center team on a flow that's 90% scripted. Karla replaces that flow with three deterministic scenarios — and escalates to a human the moment things go off-script.

✓ Confirm

The driver's data matches the system. Karla confirms and closes the call.

validar_viaje

✎ Update

There's a discrepancy (origin / destination / date). Karla updates the system.

actualizar_viaje

+ Create

The trip doesn't exist in the system. Karla walks the driver through registering it.

crear_viaje

Warm transfer to a human at any moment — native RingCentral hand-off, no SIP bridge needed. Tool: transferir_a_humano.

How the pipeline is wired

A FastAPI middleware orchestrates the voice pipeline: RingCentral provides the audio leg, Deepgram transcribes in streaming, Claude Haiku 4.5 decides the action via tool calling, ElevenLabs synthesizes the response. The actions (consult / update / create trip) execute against the client's API.

  Client operations system
         │
         │  trigger (driver_id, trip_id, phone)
         ▼
  ┌────────────────────────┐
  │  Middleware (FastAPI)  │
  └────────────────────────┘
         │
         │  outbound call
         ▼
  ┌────────────────────────┐         ┌───────────────────┐
  │  RingCentral telephony │ ◀─────▶ │   Driver (mobile) │
  └────────────────────────┘         └───────────────────┘
         │  PCM16 8kHz mono · WebSocket
         ▼
  ┌───────────────────────────────────────────────────────┐
  │  Voice Pipeline                                       │
  │  ┌──────────────┐  ┌──────────────┐  ┌────────────┐  │
  │  │ Deepgram     │─▶│ Claude       │─▶│ ElevenLabs │  │
  │  │ Nova-2 (es)  │  │ Haiku 4.5    │  │ Flash v2.5 │  │
  │  │ STT          │  │ LLM + tools  │  │ TTS        │  │
  │  └──────────────┘  └──────────────┘  └────────────┘  │
  └───────────────────────────────────────────────────────┘
         │
         │  GET / PUT / POST /trips
         ▼
  ┌────────────────────────┐
  │   Client API           │
  └────────────────────────┘
LayerChoiceWhy
TelephonyRingCentralAlready in production at the client's contact center. Native warm-transfer to a human agent, no SIP bridge required.
STTDeepgram Nova-2Best WER for colloquial Mexican Spanish. Streaming over WebSocket, <200ms latency.
LLMClaude Haiku 4.5Native tool calling, more than enough for a 3-scenario FSM. Same provider as the dev environment.
TTSElevenLabs Flash v2.5~75ms TTFB, ~50% cheaper than Turbo, quality good enough for MX-Spanish contact-center voice.
BackendPython 3.12 + FastAPIasyncio for concurrent WebSockets; first-class AI SDKs.
QueueARQ + RedisAsync-native, fits the FastAPI / asyncio model.
DBPostgreSQL 16Call logs, analytics, audit trail.
InfraAWS ECS + DockerNo Kubernetes for go-live — unnecessary overhead for a 6-week project. Future migration if the business case appears.

Backend layout

Inventory of app/ modules. Each has a clearly bounded responsibility and a paired test suite.

pipeline/ Implemented

STT (Deepgram Nova-2 streaming) → LLM (Claude Haiku 4.5 with 4 tools: validate / update / create / transfer) → TTS (ElevenLabs Flash v2.5).

integrations/ Mock active

HTTP client for the client API (GET / PUT / POST /trips). Today against a local mock; waiting on the real schema from the client.

session/ Implemented

CallSession with a finite-state machine (greeting → validating → updating / creating → closing). Per-call state and context held in Redis.

campaign/ Implemented

Outbound dispatcher + scheduler with a configurable calling window and retry logic (no answer → 30 min, max 3 attempts).

persistence/ Implemented

Postgres repos for the calls table (full audit trail: transcripts + tool calls) and campaigns.

security/ Implemented

Bearer auth on /api/calls* and /metrics. API_TOKEN validated at startup. mypy strict in auth helpers.

hardening/ Implemented

Token-bucket rate limiter (Redis) + circuit breaker for the client API + phone-number validation.

observability/ Implemented

Prometheus /metrics endpoint with STT/LLM/TTS latency histograms, HTTP request counters, and ToolDispatcher readiness counters.

Production posture

Hardening pass driven by a recent threat review, plus Prometheus-shaped metrics for alerts and dashboards.

Security

  • Bearer auth on /api/calls* and /metrics
  • trip_id pinned server-side in ToolDispatcher (prevents tool-arg injection from the LLM)
  • Redis auth mandatory in both dev and prod compose
  • Postgres + Redis in dev bound to 127.0.0.1 only
  • Startup warning if API_TOKEN is empty; bearer parsing hardened
  • mypy strict on auth helpers

Observability

  • STT / LLM / TTS latency histograms (p50, p95, p99)
  • HTTP request count + latency middleware
  • Readiness counters in ToolDispatcher
  • Prometheus-format /metrics endpoint
  • E2E smoke test verifies /metrics reflects pipeline activity

Same architecture, different knobs

The PoC is deployed and shows the full conversational flow against a mock of the client API — useful for letting the client experience Karla without telephony or sandbox access. The architecture is identical to production; only the model and transport configs change at the boundary, so migration is a config diff rather than a rewrite.

LayerPoCProductionReason
STTGroq Whisper v3 TurboDeepgram Nova-2Whisper is cheap and fast for the PoC; Deepgram has better WER on MX-Spanish for production.
LLMGemini 2.5 FlashClaude Haiku 4.5Aligns prod with the dev environment (Anthropic), native tool calling.
TTSElevenLabs Turbo v2.5ElevenLabs Flash v2.5Flash is ~50% cheaper at equivalent quality.
TransportBrowser WebSocketRingCentral PCM16 8kHzProduction rides the call center's existing telephony.
API targetIn-memory mockReal client APIPoC has no external dependency. Waiting on the client sandbox.

Unit economics at 200K calls / month

Estimate at 200,000 calls / month (~250,000 minutes). Sticker pricing, no enterprise discounts. With negotiable discounts (Deepgram + ElevenLabs) the monthly total drops to roughly $3,300-3,500.

ComponentMonthly
Deepgram Nova-2 (STT)~$1,475
RingCentral (incremental)~$950
Claude Haiku 4.5 (LLM)~$700
AWS ECS infrastructure~$550
ElevenLabs Flash v2.5 (TTS)~$250
Total~$3,925
~$3,925
per month
~$0.020
per call
~250K
minutes / month

Decisions worth calling out

  • Tool-arg pinning. The trip_id is set by the server when the call is dispatched, not trusted from the LLM's tool args. The LLM can decide which tool to call, but not which trip to mutate. A small choice that closes a whole class of prompt-injection attacks.
  • FSM in Redis. Each call is a tiny state machine (greeting → validating → updating / creating → closing). Putting state in Redis instead of in-LLM-context means a process restart mid-call doesn't lose the call — the worker reattaches and the FSM resumes.
  • Provider parity between PoC and prod. The PoC runs on cheaper providers (Groq Whisper, Gemini Flash, ElevenLabs Turbo) to keep PoC costs near zero. The architecture is identical — only model and transport configs change at the boundary. Migration to prod is a config diff, not a rewrite.
  • Warm transfer over SIP bridging. A native RingCentral transfer to a human agent is one API call. Building SIP bridging would have cost a week and added a failure mode. When the existing telephony already does it, use that.
  • No Kubernetes for go-live. ECS + Docker is enough for 6 weeks to production. The cost of platform sophistication ahead of demand is invisible until you're debugging an autoscaler at 2 AM instead of tuning prompts.