Project · Technical Deep Dive

Clawdia & openclaw

A self-hosted personal AI infrastructure running on a single Hetzner VPS for ~€15-25/month. A multi-agent gateway (openclaw), a pgvector-backed personal knowledge brain (gbrain), per-project workspace isolation, scheduled agent turns, and inbound voice calls handled by an LLM over SIP via Kamailio + rtpengine + ElevenLabs. Built over ~8 months of iteration.

Why build this

The goal isn't novelty — it's to make a personal AI that survives restarts, lives somewhere I control, and doesn't ship every conversation to a third party's logs. Cloud-vendor agents won't give me that. So Clawdia runs on a box I own, with conversations persisted into a knowledge brain I can query, and tools scoped per workspace so a casual chat can't ever touch production code.

Everything on one box

┌─────────────────────────────────────────────────────────────┐
│                       Single VPS                            │
│                                                             │
│  ┌─────────────┐   ┌──────────────┐   ┌──────────────────┐  │
│  │   gateway   │──▶│  agent main  │──▶│  MCP subprocess  │  │
│  │  (systemd)  │   │  (LLM loop)  │   │   (gbrain etc.)  │  │
│  │  port 18789 │   └──────────────┘   └──────────────────┘  │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ├──▶ WhatsApp channel  (puppeteer)                  │
│         ├──▶ browser channel   (puppeteer)                  │
│         ├──▶ active-memory plugin                           │
│         ├──▶ cron scheduler   (agentTurn jobs)              │
│         └──▶ TTS provider     (ElevenLabs)                  │
│                                                             │
│  ┌──────────────┐   ┌─────────────────┐                     │
│  │  Kamailio    │──▶│   rtpengine     │ ◀──── inbound SIP   │
│  │  (SIP proxy) │   │  (media proxy)  │       calls         │
│  └──────────────┘   └─────────────────┘                     │
│                                                             │
│  ┌─────────────────┐                                        │
│  │   PGLite + WAL  │  (gbrain storage, single-process)      │
│  └─────────────────┘                                        │
└─────────────────────────────────────────────────────────────┘
         ▲                          ▲
         │                          │
   Tailscale tunnel           Public internet
   (admin + SIP)              (HTTPS, SIP out)

Two ingress paths: Tailscale for personal access (SSH, SIP, MCP admin) and public internet only for outbound LLM/API calls and the SIP leg to ElevenLabs. UFW closes everything else.

The pieces that matter

openclaw gateway

Node process running as systemd --user. Hosts an HTTP API on loopback, owns the agent lifecycle, loads trusted plugins, and spawns sandboxed MCP subprocesses lazily over stdio. Plugins can crash the gateway; MCPs can't — that line was drawn on purpose.

Per-workspace isolation

Each agent lives in ~/.openclaw/workspace-* with its own IDENTITY.md, SOUL.md, USER.md, AGENTS.md, memory, skills, and docs. ~10 workspaces today: one personal, one financial, one per major GitHub repo. Tool surface scoped per workspace.

Model rotation + fallbacks

Each agent config declares a primary model and ordered fallbacks. Default is z.ai's GLM series (cheap, frontier-equivalent); code agents use stronger models. Auto-fallback on rate-limit or 5xx.

Bounded sub-agents

Every agent can spawn parallel sub-agents (maxConcurrent: 2), useful for fan-out work like "summarise these 5 articles". Bounds prevent runaway trees from torching the LLM budget.

Skills system

Markdown files with YAML frontmatter declaring triggers, mutating flags, and write targets. Local skills live in the workspace; bundled skills install from a clawhub registry with versioned _meta.json. Routing is substring-match — picking trigger phrases is the actual craft.

Per-repo dev agents

One agent per major repo (gh-frontend, gh-backend, gh-infra, etc.) — each with its own tool allowlist. The docs agent literally cannot rm -rf a terraform file. Blast radius by design.

gbrain — memory that survives the session

Memory doesn't live in the LLM context. It lives in gbrain, a Postgres-shaped store backed by PGLite (an embedded WASM build of Postgres) exposed to the agents as an MCP server. The agent issues tool calls (query, get_page, search, find_experts); the gateway proxies JSON-RPC over stdio.

What's stored

Pages (markdown, FTS + 1536-dim OpenAI embeddings), chunks (tree-sitter for code, recursive semantic chunker for prose), a typed graph of [[wikilinks]] edges, timeline entries, dated facts with valid_until, and "active takes" extracted from chat.

Hybrid search (RRF)

Retrieval is reciprocal rank fusion over Postgres FTS + pgvector HNSW, with configurable source boosts (curated pages outrank bulk imports 1.5× to 0.8×) and hard-exclude prefixes filtered pre-rank.

The dream cycle

Nightly cron: a cheap LLM (Haiku-class) decides which new transcripts are worth processing, then fans out Sonnet-class sub-agents with scoped write access. Sub-agents distill chronological chat into structured, queryable pages. ~$0.05-0.20/night in tokens.

Temporal facts

Extracted claims carry since_date and valid_until, so the brain answers temporal queries ("what was the team size in March?") instead of just returning current state.

cron + agentTurn

~12 scheduled tasks on this box. Each is either a direct exec (deterministic shell invocation; no LLM freedom — daily security scan, gbrain maintenance) or an agentTurn (a prompt the agent decides how to execute — morning briefing, daily summary).

Per-turn knobs: toolsAllow restricts the tool set so a "send a message" cron can't accidentally browse the web for 20 minutes; timeoutSeconds is a hard cap; failureAlert pages via WhatsApp after N consecutive failures.

The wedge fix: if a subprocess (e.g. gbrain dream) goes into a busy loop, killing the agent session leaves orphan children holding a DB lock. Two-layer fix: (1) the runner uses POSIX process groups (setpgid) so timeouts kill the whole tree; (2) lock holders are evicted on next-acquire only when the PID is provably dead on the same host. Cross-host PID probes evict live holders during network blips — don't.

Talk to Clawdia like a phone call

Pick up a phone, dial Clawdia, talk to her like a normal call. No PSTN number rental, no per-minute carrier fees, snappy enough to feel like a real conversation.

[Phone with Linphone SIP client]
        │  SIP REGISTER + INVITE over TCP, via Tailscale (WireGuard NAT-punched)
        ▼
[Kamailio SIP proxy on VPS]
        │  - digest auth gate
        │  - REGISTER → in-memory location table
        │  - INVITE → SDP rewrite via rtpengine_manage()
        │  - X-Caller-ID header injection
        ▼
[rtpengine media proxy]    dual interface: tail/<tail_ip> ; pub/<public_ip>
        │                  rewrites SDP c= lines per side, relays RTP bidirectionally
        ▼
[ElevenLabs Conversational AI]
        │  ElevenLabs TTS  +  Gemini Flash 2.5 LLM  +  ASR/TTS pipeline
        │  SIP trunk with ACL whitelist of our public IP only
        ▼
(next: tools bridge → openclaw → gbrain)

Why this stack

Four iterations before landing here: ElevenLabs web widget (browser tab — awful), Twilio Voice (~$1/mo + per-min, routes audio through their POPs), WebRTC from a custom mobile app (too much yak shaving), and finally SIP direct from Linphone via a proxy I control. Free, ergonomic, and reliable.

The latency story

First end-to-end call worked at ~1.3s perceived turn latency. Bad enough to feel like a walkie-talkie. The budget:

HopOne-way
Phone (CDMX) → server (Hetzner Nürnberg), Tailscale direct~140-180 ms
Server (Germany) → ElevenLabs (GCP Iowa)~114 ms
ElevenLabs pipeline (VAD + ASR + LLM + TTS)~600-1000 ms
Perceived turn latency~1.1-1.5 s

Config-level wins (no infra changes): turn_eagerness: eager + speculative_turn: true (start generating during the user's silence, discard if they keep talking) + a shorter turn_timeout. Knocked off ~300-400 ms. The remaining ~700 ms is mostly the transatlantic detour — moving the SIP proxy to a US or MX VPS shaves another ~100-150 ms.

Technology

LayerTechnologies
HostHetzner VPS · Ubuntu 24.04 · 4 GB RAM · Tailscale + UFW
GatewayNode.js · systemd --user · loopback HTTP on :18789 · stdio MCP
Agent loopz.ai GLM-5 / GLM-5.1 (default) · Claude Opus / Sonnet / Haiku · model fallbacks
Knowledge brainPGLite (WASM Postgres) · pgvector HNSW · tsvector FTS · OpenAI text-embedding-3-small
RetrievalRRF fusion · source boosts · tree-sitter code chunking
ChannelsWhatsApp (Web JS) · puppeteer browser tool · ElevenLabs TTS
SIP signalingKamailio 5.7 · digest auth · NAT handling · X-Caller-ID injection
SIP mediartpengine · userspace forwarding · dual-interface SDP rewrite
Voice agentElevenLabs Conversational AI · Gemini Flash 2.5 · SIP trunk ACL
Schedulingcron + agentTurn · process-group timeouts · same-host PID lock liveness
Monthly cost~€15-25 (VPS + LLM tokens + TTS + embeddings)

Lessons that aren't in any docs

  • PID-liveness checks must be same-host only. A "stale" lock from another machine's process is not provably dead from yours. Cross-host PID probes will eventually evict a live holder during a network blip.
  • Kamailio tcp_connection_lifetime defaults to ~120s. Linphone keeps TCP open longer; bump to 3600+ or your clients re-REGISTER every two minutes.
  • systemd rate-limits restarts (5 in 10s). Iterating on a Kamailio config that keeps crashing eventually hits "Start request repeated too quickly" — and the error doesn't tell you that's why. systemctl reset-failed.
  • ElevenLabs SIP trunk imports default to 0.0.0.0/0 ACL. Anyone who finds the URI can dial in and burn credits. PATCH inbound_trunk_config.allowed_addresses to lock to your proxy's IP.
  • ElevenLabs API keys are per-area scoped. TTS and Agents are separate scopes — your TTS key won't auth the Agents API.
  • Em-dashes in Kamailio config comments break the parser. Stay ASCII. (Yes, really.)
  • record_route() is request_route-only in Kamailio 5.7. Most copy-pasted online configs that put it in onreply_route are wrong.
  • PGLite + long-running MCP holds the write lock. Shell CLI invocations block. Either run CLI in-process or stop the gateway during heavy CLI work.

Unintuitive lessons

  • Pick one host. Run everything there. Distributed multi-machine setups for a personal AI are over-engineering and a debugging nightmare.
  • Per-project workspaces beat one mega-agent. Tool surface scoping is worth the setup cost.
  • Don't trust auto-cleanup. Crash paths leave debris. Add explicit liveness checks for any lock or shared resource.
  • Pick infrastructure pieces with good logs over fancy abstractions. Kamailio and rtpengine are 20-year-old C codebases with weird config syntax — but their failure modes are documented. Worth more than a slick SDK with cryptic errors.