Clawdia & openclaw
A self-hosted personal AI infrastructure running on a single Hetzner VPS for ~€15-25/month. A multi-agent gateway (openclaw), a pgvector-backed personal knowledge brain (gbrain), per-project workspace isolation, scheduled agent turns, and inbound voice calls handled by an LLM over SIP via Kamailio + rtpengine + ElevenLabs. Built over ~8 months of iteration.
The Goal
Why build this
The goal isn't novelty — it's to make a personal AI that survives restarts, lives somewhere I control, and doesn't ship every conversation to a third party's logs. Cloud-vendor agents won't give me that. So Clawdia runs on a box I own, with conversations persisted into a knowledge brain I can query, and tools scoped per workspace so a casual chat can't ever touch production code.
Topology
Everything on one box
┌─────────────────────────────────────────────────────────────┐
│ Single VPS │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ gateway │──▶│ agent main │──▶│ MCP subprocess │ │
│ │ (systemd) │ │ (LLM loop) │ │ (gbrain etc.) │ │
│ │ port 18789 │ └──────────────┘ └──────────────────┘ │
│ └──────┬──────┘ │
│ │ │
│ ├──▶ WhatsApp channel (puppeteer) │
│ ├──▶ browser channel (puppeteer) │
│ ├──▶ active-memory plugin │
│ ├──▶ cron scheduler (agentTurn jobs) │
│ └──▶ TTS provider (ElevenLabs) │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Kamailio │──▶│ rtpengine │ ◀──── inbound SIP │
│ │ (SIP proxy) │ │ (media proxy) │ calls │
│ └──────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ PGLite + WAL │ (gbrain storage, single-process) │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲ ▲
│ │
Tailscale tunnel Public internet
(admin + SIP) (HTTPS, SIP out)
Two ingress paths: Tailscale for personal access (SSH, SIP, MCP admin) and public internet only for outbound LLM/API calls and the SIP leg to ElevenLabs. UFW closes everything else.
Architecture
The pieces that matter
openclaw gateway
Node process running as systemd --user. Hosts an HTTP API on
loopback, owns the agent lifecycle, loads trusted plugins, and spawns
sandboxed MCP subprocesses lazily over stdio. Plugins can crash the gateway;
MCPs can't — that line was drawn on purpose.
Per-workspace isolation
Each agent lives in ~/.openclaw/workspace-* with its own
IDENTITY.md, SOUL.md, USER.md,
AGENTS.md, memory, skills, and docs. ~10 workspaces today: one
personal, one financial, one per major GitHub repo. Tool surface scoped per
workspace.
Model rotation + fallbacks
Each agent config declares a primary model and ordered fallbacks. Default is z.ai's GLM series (cheap, frontier-equivalent); code agents use stronger models. Auto-fallback on rate-limit or 5xx.
Bounded sub-agents
Every agent can spawn parallel sub-agents (maxConcurrent: 2),
useful for fan-out work like "summarise these 5 articles". Bounds prevent
runaway trees from torching the LLM budget.
Skills system
Markdown files with YAML frontmatter declaring triggers, mutating flags, and
write targets. Local skills live in the workspace; bundled skills install
from a clawhub registry with versioned _meta.json. Routing is
substring-match — picking trigger phrases is the actual craft.
Per-repo dev agents
One agent per major repo (gh-frontend, gh-backend,
gh-infra, etc.) — each with its own tool allowlist. The docs
agent literally cannot rm -rf a terraform file. Blast radius by
design.
Knowledge Brain
gbrain — memory that survives the session
Memory doesn't live in the LLM context. It lives in gbrain,
a Postgres-shaped store backed by PGLite (an embedded WASM build of Postgres)
exposed to the agents as an MCP server. The agent issues tool calls
(query, get_page, search,
find_experts); the gateway proxies JSON-RPC over stdio.
What's stored
Pages (markdown, FTS + 1536-dim OpenAI embeddings), chunks (tree-sitter for
code, recursive semantic chunker for prose), a typed graph of
[[wikilinks]] edges, timeline entries, dated facts with
valid_until, and "active takes" extracted from chat.
Hybrid search (RRF)
Retrieval is reciprocal rank fusion over Postgres FTS + pgvector HNSW, with configurable source boosts (curated pages outrank bulk imports 1.5× to 0.8×) and hard-exclude prefixes filtered pre-rank.
The dream cycle
Nightly cron: a cheap LLM (Haiku-class) decides which new transcripts are worth processing, then fans out Sonnet-class sub-agents with scoped write access. Sub-agents distill chronological chat into structured, queryable pages. ~$0.05-0.20/night in tokens.
Temporal facts
Extracted claims carry since_date and valid_until,
so the brain answers temporal queries ("what was the team size in March?")
instead of just returning current state.
Scheduled Work
cron + agentTurn
~12 scheduled tasks on this box. Each is either a direct exec (deterministic shell invocation; no LLM freedom — daily security scan, gbrain maintenance) or an agentTurn (a prompt the agent decides how to execute — morning briefing, daily summary).
Per-turn knobs: toolsAllow restricts the tool set so a "send a
message" cron can't accidentally browse the web for 20 minutes;
timeoutSeconds is a hard cap; failureAlert pages
via WhatsApp after N consecutive failures.
The wedge fix: if a subprocess (e.g. gbrain dream)
goes into a busy loop, killing the agent session leaves orphan children holding
a DB lock. Two-layer fix: (1) the runner uses POSIX process groups
(setpgid) so timeouts kill the whole tree; (2) lock holders are
evicted on next-acquire only when the PID is provably dead on the same
host. Cross-host PID probes evict live holders during network blips — don't.
Voice
Talk to Clawdia like a phone call
Pick up a phone, dial Clawdia, talk to her like a normal call. No PSTN number rental, no per-minute carrier fees, snappy enough to feel like a real conversation.
[Phone with Linphone SIP client]
│ SIP REGISTER + INVITE over TCP, via Tailscale (WireGuard NAT-punched)
▼
[Kamailio SIP proxy on VPS]
│ - digest auth gate
│ - REGISTER → in-memory location table
│ - INVITE → SDP rewrite via rtpengine_manage()
│ - X-Caller-ID header injection
▼
[rtpengine media proxy] dual interface: tail/<tail_ip> ; pub/<public_ip>
│ rewrites SDP c= lines per side, relays RTP bidirectionally
▼
[ElevenLabs Conversational AI]
│ ElevenLabs TTS + Gemini Flash 2.5 LLM + ASR/TTS pipeline
│ SIP trunk with ACL whitelist of our public IP only
▼
(next: tools bridge → openclaw → gbrain)
Why this stack
Four iterations before landing here: ElevenLabs web widget (browser tab — awful), Twilio Voice (~$1/mo + per-min, routes audio through their POPs), WebRTC from a custom mobile app (too much yak shaving), and finally SIP direct from Linphone via a proxy I control. Free, ergonomic, and reliable.
The latency story
First end-to-end call worked at ~1.3s perceived turn latency. Bad enough to feel like a walkie-talkie. The budget:
| Hop | One-way |
|---|---|
| Phone (CDMX) → server (Hetzner Nürnberg), Tailscale direct | ~140-180 ms |
| Server (Germany) → ElevenLabs (GCP Iowa) | ~114 ms |
| ElevenLabs pipeline (VAD + ASR + LLM + TTS) | ~600-1000 ms |
| Perceived turn latency | ~1.1-1.5 s |
Config-level wins (no infra changes): turn_eagerness: eager +
speculative_turn: true (start generating during the user's silence,
discard if they keep talking) + a shorter turn_timeout. Knocked off
~300-400 ms. The remaining ~700 ms is mostly the transatlantic detour — moving
the SIP proxy to a US or MX VPS shaves another ~100-150 ms.
Stack
Technology
| Layer | Technologies |
|---|---|
| Host | Hetzner VPS · Ubuntu 24.04 · 4 GB RAM · Tailscale + UFW |
| Gateway | Node.js · systemd --user · loopback HTTP on :18789 · stdio MCP |
| Agent loop | z.ai GLM-5 / GLM-5.1 (default) · Claude Opus / Sonnet / Haiku · model fallbacks |
| Knowledge brain | PGLite (WASM Postgres) · pgvector HNSW · tsvector FTS · OpenAI text-embedding-3-small |
| Retrieval | RRF fusion · source boosts · tree-sitter code chunking |
| Channels | WhatsApp (Web JS) · puppeteer browser tool · ElevenLabs TTS |
| SIP signaling | Kamailio 5.7 · digest auth · NAT handling · X-Caller-ID injection |
| SIP media | rtpengine · userspace forwarding · dual-interface SDP rewrite |
| Voice agent | ElevenLabs Conversational AI · Gemini Flash 2.5 · SIP trunk ACL |
| Scheduling | cron + agentTurn · process-group timeouts · same-host PID lock liveness |
| Monthly cost | ~€15-25 (VPS + LLM tokens + TTS + embeddings) |
Things That Bit Me
Lessons that aren't in any docs
- PID-liveness checks must be same-host only. A "stale" lock from another machine's process is not provably dead from yours. Cross-host PID probes will eventually evict a live holder during a network blip.
- Kamailio
tcp_connection_lifetimedefaults to ~120s. Linphone keeps TCP open longer; bump to 3600+ or your clients re-REGISTER every two minutes. - systemd rate-limits restarts (5 in 10s). Iterating on a Kamailio config that keeps crashing eventually hits "Start request repeated too quickly" — and the error doesn't tell you that's why.
systemctl reset-failed. - ElevenLabs SIP trunk imports default to
0.0.0.0/0ACL. Anyone who finds the URI can dial in and burn credits. PATCHinbound_trunk_config.allowed_addressesto lock to your proxy's IP. - ElevenLabs API keys are per-area scoped. TTS and Agents are separate scopes — your TTS key won't auth the Agents API.
- Em-dashes in Kamailio config comments break the parser. Stay ASCII. (Yes, really.)
record_route()isrequest_route-only in Kamailio 5.7. Most copy-pasted online configs that put it inonreply_routeare wrong.- PGLite + long-running MCP holds the write lock. Shell CLI invocations block. Either run CLI in-process or stop the gateway during heavy CLI work.
Takeaways
Unintuitive lessons
- Pick one host. Run everything there. Distributed multi-machine setups for a personal AI are over-engineering and a debugging nightmare.
- Per-project workspaces beat one mega-agent. Tool surface scoping is worth the setup cost.
- Don't trust auto-cleanup. Crash paths leave debris. Add explicit liveness checks for any lock or shared resource.
- Pick infrastructure pieces with good logs over fancy abstractions. Kamailio and rtpengine are 20-year-old C codebases with weird config syntax — but their failure modes are documented. Worth more than a slick SDK with cryptic errors.