Project · Technical Deep Dive

Karla — AI Voice Agent

Production voice AI agent for a Mexican logistics company. Karla calls (and receives calls from) truck drivers to validate cargo trip data — confirming matches, updating discrepancies, or registering new trips — and writes back to the client's operations system in real time. Speaks colloquial Mexican Spanish, keeps turns under two sentences, and warm-transfers to a human agent at any point.

200K

Calls / month

~$0.020

Cost / call

<1.5s

Pipeline latency

Scenarios + escalation

PoC on GitHub ↗ Back to portfolio

The Use Case

What Karla actually does

Every cargo trip leg ends with the same task: someone has to call the driver, verify the trip data, and update the operations system. Multiply by tens of thousands of legs per month and the client was burning a contact-center team on a flow that's 90% scripted. Karla replaces that flow with three deterministic scenarios — and escalates to a human the moment things go off-script.

✓ Confirm

The driver's data matches the system. Karla confirms and closes the call.

validar_viaje

✎ Update

There's a discrepancy (origin / destination / date). Karla updates the system.

actualizar_viaje

＋ Create

The trip doesn't exist in the system. Karla walks the driver through registering it.

crear_viaje

Warm transfer to a human at any moment — native RingCentral hand-off, no SIP bridge needed. Tool: transferir_a_humano.

Architecture

How the pipeline is wired

A FastAPI middleware orchestrates the voice pipeline: RingCentral provides the audio leg, Deepgram transcribes in streaming, Claude Haiku 4.5 decides the action via tool calling, ElevenLabs synthesizes the response. The actions (consult / update / create trip) execute against the client's API.

  Client operations system
         │
         │  trigger (driver_id, trip_id, phone)
         ▼
  ┌────────────────────────┐
  │  Middleware (FastAPI)  │
  └────────────────────────┘
         │
         │  outbound call
         ▼
  ┌────────────────────────┐         ┌───────────────────┐
  │  RingCentral telephony │ ◀─────▶ │   Driver (mobile) │
  └────────────────────────┘         └───────────────────┘
         │  PCM16 8kHz mono · WebSocket
         ▼
  ┌───────────────────────────────────────────────────────┐
  │  Voice Pipeline                                       │
  │  ┌──────────────┐  ┌──────────────┐  ┌────────────┐  │
  │  │ Deepgram     │─▶│ Claude       │─▶│ ElevenLabs │  │
  │  │ Nova-2 (es)  │  │ Haiku 4.5    │  │ Flash v2.5 │  │
  │  │ STT          │  │ LLM + tools  │  │ TTS        │  │
  │  └──────────────┘  └──────────────┘  └────────────┘  │
  └───────────────────────────────────────────────────────┘
         │
         │  GET / PUT / POST /trips
         ▼
  ┌────────────────────────┐
  │   Client API           │
  └────────────────────────┘

Layer	Choice	Why
Telephony	RingCentral	Already in production at the client's contact center. Native warm-transfer to a human agent, no SIP bridge required.
STT	Deepgram Nova-2	Best WER for colloquial Mexican Spanish. Streaming over WebSocket, <200ms latency.
LLM	Claude Haiku 4.5	Native tool calling, more than enough for a 3-scenario FSM. Same provider as the dev environment.
TTS	ElevenLabs Flash v2.5	~75ms TTFB, ~50% cheaper than Turbo, quality good enough for MX-Spanish contact-center voice.
Backend	Python 3.12 + FastAPI	asyncio for concurrent WebSockets; first-class AI SDKs.
Queue	ARQ + Redis	Async-native, fits the FastAPI / asyncio model.
DB	PostgreSQL 16	Call logs, analytics, audit trail.
Infra	AWS ECS + Docker	No Kubernetes for go-live — unnecessary overhead for a 6-week project. Future migration if the business case appears.

Modules

Backend layout

Inventory of app/ modules. Each has a clearly bounded responsibility and a paired test suite.

pipeline/ Implemented

STT (Deepgram Nova-2 streaming) → LLM (Claude Haiku 4.5 with 4 tools: validate / update / create / transfer) → TTS (ElevenLabs Flash v2.5).

integrations/ Mock active

HTTP client for the client API (GET / PUT / POST /trips). Today against a local mock; waiting on the real schema from the client.

session/ Implemented

CallSession with a finite-state machine (greeting → validating → updating / creating → closing). Per-call state and context held in Redis.

campaign/ Implemented

Outbound dispatcher + scheduler with a configurable calling window and retry logic (no answer → 30 min, max 3 attempts).

persistence/ Implemented

Postgres repos for the calls table (full audit trail: transcripts + tool calls) and campaigns.

security/ Implemented

Bearer auth on /api/calls* and /metrics. API_TOKEN validated at startup. mypy strict in auth helpers.

hardening/ Implemented

Token-bucket rate limiter (Redis) + circuit breaker for the client API + phone-number validation.

observability/ Implemented

Prometheus /metrics endpoint with STT/LLM/TTS latency histograms, HTTP request counters, and ToolDispatcher readiness counters.

Security & Observability

Production posture

Hardening pass driven by a recent threat review, plus Prometheus-shaped metrics for alerts and dashboards.

Security

Bearer auth on /api/calls* and /metrics
trip_id pinned server-side in ToolDispatcher (prevents tool-arg injection from the LLM)
Redis auth mandatory in both dev and prod compose
Postgres + Redis in dev bound to 127.0.0.1 only
Startup warning if API_TOKEN is empty; bearer parsing hardened
mypy strict on auth helpers

Observability

STT / LLM / TTS latency histograms (p50, p95, p99)
HTTP request count + latency middleware
Readiness counters in ToolDispatcher
Prometheus-format /metrics endpoint
E2E smoke test verifies /metrics reflects pipeline activity

PoC vs Production

Same architecture, different knobs

The PoC is deployed and shows the full conversational flow against a mock of the client API — useful for letting the client experience Karla without telephony or sandbox access. The architecture is identical to production; only the model and transport configs change at the boundary, so migration is a config diff rather than a rewrite.

Layer	PoC	Production	Reason
STT	Groq Whisper v3 Turbo	Deepgram Nova-2	Whisper is cheap and fast for the PoC; Deepgram has better WER on MX-Spanish for production.
LLM	Gemini 2.5 Flash	Claude Haiku 4.5	Aligns prod with the dev environment (Anthropic), native tool calling.
TTS	ElevenLabs Turbo v2.5	ElevenLabs Flash v2.5	Flash is ~50% cheaper at equivalent quality.
Transport	Browser WebSocket	RingCentral PCM16 8kHz	Production rides the call center's existing telephony.
API target	In-memory mock	Real client API	PoC has no external dependency. Waiting on the client sandbox.

Operating Costs

Unit economics at 200K calls / month

Estimate at 200,000 calls / month (~250,000 minutes). Sticker pricing, no enterprise discounts. With negotiable discounts (Deepgram + ElevenLabs) the monthly total drops to roughly $3,300-3,500.

Component	Monthly
Deepgram Nova-2 (STT)	~$1,475
RingCentral (incremental)	~$950
Claude Haiku 4.5 (LLM)	~$700
AWS ECS infrastructure	~$550
ElevenLabs Flash v2.5 (TTS)	~$250
Total	~$3,925

~$3,925

per month