Hermes message routing — how AGNT routes between agents, channels, and models
Hermes is the message router at the center of AGNT's agent network — it decides which agent handles a message, which model processes it, and which channel delivers the response.
Hermes is the routing layer between incoming messages (WhatsApp, Telegram, API) and the agent fleet. Every user message passes through Hermes before it reaches an LLM. This guide walks through every stage of the routing pipeline — from channel ingestion through soul loading, model selection, tool execution, and response delivery — with references to the actual modules that implement each stage.
Prerequisites
- Familiarity with AGNT's agent architecture.
- Understanding of LLM APIs.
- Basic knowledge of message queuing.
What Hermes does
Hermes is not a single binary — it is the routing pattern that connects AGNT's channel webhooks to its LLM backends and back. When a user sends "find me a sunset bar in Canggu" on WhatsApp, Hermes orchestrates the following sequence: (a) the 360Dialog webhook receives the inbound message, (b) `msg_router.handle_message` resolves the user via the `user_channels` table by matching platform + channel_id, (c) the soul loader builds a system prompt from the user's personality, memory, recent context, and semantically recalled facts, (d) the LLM gateway routes to the correct model backend based on the user's tier, (e) if the model invokes tools, `tool_executor.run` dispatches to the appropriate handler (venue search, booking, calorie scan, etc.), (f) `channel_sender` delivers the response back to the originating platform.
The core implementation lives in `agnt-backend/app/core/msg_router.py`. It is a single async function — `handle_message(event, db)` — that owns the entire pipeline from channel identification to response assembly. Every other module it calls (soul_loader, llm_gateway, tool_executor, session_store, channel_sender) is a stateless service with a narrow interface. This design means Hermes can be tested by mocking any single stage without standing up the full stack.
The routing decision tree
Hermes processes every inbound message through a fixed 10-step pipeline. The steps, in order: (1) Resolve the user from the channel — look up `UserChannel` by platform + sender_id, reject unactivated channels. (2) Check credits — `credit_manager.check_and_deduct` with SELECT FOR UPDATE to prevent double-spend on concurrent messages. (3) Load the soul — build the system prompt from structural memory, cross-session context, and semantic recall (pgvector cosine search against the user's non-structural memory facts). (4) Load session history — platform-scoped, Fernet-encrypted conversation history from Redis. (5) Handle image routing — if the message includes an image, classify intent (dupe_search, calorie_scan, or general) and short-circuit to the appropriate vision pipeline. (6) Build the messages array — combine history + current message, including multimodal content blocks for images. (7) Gate tools — `tool_gate.filter_tools` prunes the 17-tool schema list based on message intent to cut input-token waste (~2,100 tokens saved per trivial message). (8) LLM call — `llm_gateway.complete` sends to the model with tier-appropriate backend selection and a 45-second timeout. (9) Execute tool calls — if the model requests tools, dispatch via `tool_executor.run`, then do a second LLM call with tool results (max 1 tool round per message). (10) Save history + fire-and-forget logging — append to session store, log the interaction with full provenance metadata, and trigger `memory_writer.extract_and_save` for automatic memory extraction.
Each step has explicit fallbacks. If the user is not found, Hermes returns a signup prompt. If credits are exhausted, it returns an upgrade CTA with the user's referral code. If the LLM call fails, it refunds the credit via `credit_manager.refund_credit` and returns a friendly retry message. If tool execution succeeds but the second LLM call fails, it falls back to raw tool output prefixed with a disclaimer. The pipeline never silently drops a message.
Fleet adapters and model selection
Hermes routes to models through the LLM gateway (`agnt-backend/app/core/llm_gateway.py`). The gateway maintains a `TIER_MAP` that maps user subscription tiers to ordered lists of model backends: free users route to `claude-haiku`, pro users to `claude-sonnet`, starter users get `claude-haiku` with Ollama as fallback, venue-growth agents get `claude-sonnet` with `claude-haiku` as fallback. The model map resolves logical names to actual model IDs — `claude-haiku` maps to `claude-haiku-4-5-20251001`, `claude-sonnet` maps to `claude-sonnet-4-6`.
Cost tracking is per-call. The gateway computes USD cost from per-million-token rates ($0.80/$4.00 input/output for Haiku, $3.00/$15.00 for Sonnet) and records it to both per-user daily budgets and a platform-wide global spend counter in Redis. Max output tokens are tier-gated too: 1,024 for Haiku, 2,048 for Sonnet.
The adapter interface in the fleet context (managed via Paperclip) exposes four methods: `complete`, `stream`, `supports_tools`, and `health_check`. When a fleet adapter swap is issued (e.g., moving an agent from `claude_local` to `codex_local` during a provider outage), the runtime patches the adapter pointer without restarting the agent process. The `preserve_cwd` flag ensures the agent's working directory survives the swap — critical for in-flight tool calls that depend on file state. A failed health check during the swap triggers an automatic rollback to the previous adapter.
Soul loading
Before any LLM call, Hermes loads the agent's "soul" — a structured system prompt assembled from multiple data sources. The soul loader (`agnt-backend/app/core/soul_loader.py`) builds the prompt in layers: base personality (agent name, tone from 4 options — chill/friendly/professional/hype, role instructions), user facts (16 structural memory keys including diet, favorite areas, interests, last booking, typical party size, preferred booking hour/day, favorite venue, favorite category), fitness context (calorie target, dietary restrictions, fitness goal), cross-session context (last 5 inbound messages from past sessions with relative timestamps), and deep-dream synthesis (weekly insights + up to 3 knowledge gaps from the dreaming engine, probed gently across conversations).
The base prompt — everything except semantic recall — is cached in Redis with a 1-hour TTL (`SOUL_TTL = 3600`). The cache key is `soul:{user_id}:base`. Semantic recall is per-message: the soul loader embeds the inbound message via OpenAI, runs a pgvector cosine-distance query (`ORDER BY embedding <=> CAST(:qemb AS vector)`) against the user's `user_memory` rows (excluding structural keys to avoid duplication), and appends the top 10 matching facts as a `RELEVANT CONTEXT` block. Trivial messages (greetings, single emojis, "ok", "thanks") skip the embed + pgvector lookup entirely — a regex filter (`_TRIVIAL_MSG_RE`) catches these and increments `semantic_recall_skipped_trivial` to track cost savings.
All user-editable fields pass through `_sanitize_memory_value` before prompt interpolation. This function strips prompt injection patterns (16 attack vectors including "ignore previous instructions", "system override", "DAN mode", "jailbreak", "you are now"), collapses multi-line whitespace, and truncates to a safe length (200 chars for values, 50 for keys). Soul cache invalidation retries 3 times with backoff (0.5s, 1.0s) and alerts ops on final failure — a stale soul can serve for up to 1 hour.
Channel multiplexing
Hermes handles WhatsApp (via 360Dialog), Telegram (via Bot API), Instagram (via Meta Graph API), and REST API simultaneously. Each channel has its own webhook handler (`agnt-backend/app/routers/webhook.py`), its own message format normalization, and its own outbound sender in `channel_sender.py`. All inbound messages are normalized to a `MessageEvent` dataclass (platform, sender_id, text, media_url, media_type) before reaching `msg_router.handle_message` — the router never sees platform-specific fields.
Outbound delivery uses a shared `httpx.AsyncClient` initialized at application startup with a 10-second timeout. Each platform sender (`_send_telegram`, `_send_whatsapp`, `_send_instagram`) handles its own API format, authentication (Telegram bot token, 360Dialog D360-API-KEY header, Meta Graph access_token param), and error codes. All senders use a common retry helper with exponential backoff + jitter (delays: 0s, 1s, 3s, 9s) and respect 429 Retry-After headers. Permanent 4xx errors are not retried; transient 5xx errors and network timeouts are.
Session history is platform-scoped: a WhatsApp conversation and a Telegram conversation for the same user maintain separate histories. The scope key is `{platform}:channel:{channel_id}`, which prevents cross-channel context leakage while allowing the user to pick up naturally on each platform. History is stored in Redis with Fernet encryption, and the session idle timeout is 12 hours — after that, a new session is created in the database.
Failover and backpressure
Hermes applies backpressure at two levels. First, the LLM gateway uses an `asyncio.Semaphore(30)` to cap concurrent LLM calls across the entire process. If all 30 slots are occupied, new requests wait up to 15 seconds before raising a capacity error. This prevents a traffic spike from overwhelming the Anthropic API and triggering rate limits.
Second, the global spend cap (`LLM_GLOBAL_DAILY_BUDGET_USD`) acts as a platform-wide circuit breaker. The gateway increments a Redis counter (`agnt:global_spend:{date}`) with each call's USD cost via `INCRBYFLOAT`. When the counter exceeds the cap, all LLM calls are blocked for the remainder of the day. A manual `kill_switch()` function can instantly set this counter to 999,999 in an emergency. The global budget check fails open on Redis unavailability — a Redis outage should not block LLM calls. Per-user daily token budgets (50K free, 200K starter, 1M pro) fail closed — if Redis is down, the request is denied.
On the channel delivery side, each platform has a Redis-backed circuit breaker. After 10 consecutive send failures within a 5-minute sliding window, the circuit opens for 2 minutes, skipping all sends to that platform. Failed sends are pushed to a Redis retry queue (`failed_sends`) with a backoff schedule of 1 minute, 5 minutes, and 25 minutes. After 3 failed retries, messages move to a dead-letter queue (`failed_sends:dead_letter`) retained for 7 days. Failure reasons are classified: `rate_limited`, `network_timeout`, `service_unavailable`, `unauthorized`, `invalid_recipient`. The circuit resets on the first successful send. Queue length is capped at 5,000 entries.
Model failover is tier-based and automatic. If `claude-sonnet` fails for a venue-growth agent, the gateway falls through to `claude-haiku`. If `claude-haiku` fails for a starter user, it falls through to Ollama (local). The Ollama fallback does not support tool calls — if the user's message triggered tool schemas, Ollama is skipped and the error propagates. When all backends in the tier's chain fail, the error message includes the last exception for debugging.
Monitoring Hermes
Hermes emits Prometheus-style metrics via the `/metrics` endpoint (`agnt-backend/app/core/metrics.py`). Key counters: `llm_calls{backend, tier}` — total LLM completions by model and subscription tier. `llm_tokens{backend}` — total tokens consumed by model. `llm_cache_hits{backend, tier}` — responses served from cache (only pure-text completions are cached; tool-bearing calls always hit the model). `semantic_recall_skipped_trivial` — messages where the pgvector lookup was skipped because the input matched the trivial-message regex.
Every interaction is logged to the `interactions` table with a `meta` JSONB column containing full provenance: `soul_load_ms` (time to build the system prompt, in milliseconds), `semantic_recall_used` (boolean — whether pgvector was queried), `llm_backend` (which model responded), `llm_cache_hit` (boolean), `tools` (array of `{name}` objects for each tool invoked), `tool_exec_ms` (tool execution time), `tokens_in`, `tokens_out`, and `cost_usd`. This metadata powers the ops dashboard and cost attribution reports. The `AgentResponse` returned to callers also carries `tokens_in`, `tokens_out`, `cost_usd`, and `tool_calls_made` for real-time visibility.
The alerting module (`agnt-backend/app/core/alerting.py`) sends Slack/Discord webhook alerts on critical failures. Soul cache invalidation failures trigger an alert after 3 retries with the user ID and TTL exposure window. Channel circuit breaker openings log at warning level with the platform name and reset duration. The `kill_switch()` function logs at warning level when activated. The `MetricsMiddleware` records per-endpoint `request_count`, `request_errors`, and `request_latency_sum` with method + path labels for all non-metrics endpoints.
Why this matters
Hermes is the reason AGNT can serve 16 languages across 3 messaging platforms with sub-3-second response times. Without it, every agent would need its own message handling, model selection, credit checking, memory loading, and channel delivery — duplicating hundreds of lines of safety-critical code across every entry point.
The centralized routing pattern also makes cost control tractable. Token budgets (daily per-user, monthly per-user, global platform) are enforced at a single chokepoint in `llm_gateway.complete`. Tool gating (filtering the 17-tool schema list based on message intent) eliminated ~2,100 tokens of input waste per trivial message. The LLM response cache (keyed by backend + system prompt + messages, tool-free calls only) avoids redundant model calls entirely for repeated patterns. Credit refunds on LLM failure ensure users are never charged for errors.
For builders integrating with AGNT's fleet: Hermes routing decisions map directly onto heartbeat events. Every classified intent, chosen model, tool invocation, and response delivery becomes a traceable event in AGNT's supervisor UI — replayable for debugging. The runtime supervisor can adopt Hermes as a routing adapter — the supervisor handles singleton leader election and stuck-worker restarts, while Hermes decides which model handles each message. The fleet spend cap and backpressure semaphore wrap the entire routing decision: Hermes picks; AGNT enforces the budget envelope.