Why a gateway
Calling Anthropic directly from every tool and every handler would scatter rate limits, retry logic, and token accounting across the codebase. The gateway centralises all of that so there's exactly one place that speaks to the model. Swap the provider and nothing else changes.
Shared client
The Anthropic client is lazily created on first call and held as a module singleton. A global asyncio semaphore caps concurrent calls at 30 so the backend never overwhelms the Anthropic API with bursty webhook traffic.
_LLM_SEMAPHORE = asyncio.Semaphore(30) # Max 30 concurrent LLM calls
_anthropic_client: anthropic.AsyncAnthropic | None = None
def _get_anthropic_client() -> anthropic.AsyncAnthropic:
global _anthropic_client
if _anthropic_client is None:
_anthropic_client = anthropic.AsyncAnthropic(api_key=settings.ANTHROPIC_API_KEY)
return _anthropic_clientToken budgets
Every user belongs to a tier. Each tier has a daily token cap tracked in Redis with a 24-hour TTL. Before any LLM call, the gateway checks the budget. If the budget check itself fails (Redis unreachable), the gateway fails closed— denying the request rather than risking a runaway bill.
Daily token limits
| Tier | Daily tokens | Audience |
|---|---|---|
free | 50,000 | Anyone who just created an account |
starter | 200,000 | Consumer Starter subscription |
pro | 1,000,000 | Consumer Pro subscription |
venue_starter | 200,000 | Venue Starter tier |
venue_growth | 500,000 | Venue Growth tier |
venue_pro | 1,000,000 | Venue Pro tier |
_DAILY_TOKEN_LIMITS = {
"free": 50_000,
"starter": 200_000,
"pro": 1_000_000,
"venue_starter": 200_000,
"venue_growth": 500_000,
"venue_pro": 1_000_000,
}
async def check_token_budget(user_id: str, tier: str) -> bool:
"""Check if user is within their daily token budget. Returns True if allowed."""
try:
from app.core.redis import get_redis
r = await get_redis()
key = f"token_budget:{user_id}"
used = await r.get(key)
limit = _DAILY_TOKEN_LIMITS.get(tier, _DAILY_TOKEN_LIMITS["free"])
if used and int(used) >= limit:
return False
return True
except Exception:
logger.warning("Token budget check failed (Redis unavailable), denying request")
return False # Fail closedUsage recording
Every completion updates both a daily counter (24 hour TTL) and a monthly counter (35 day TTL). The monthly counter feeds an optional hard stop controlled by LLM_MONTHLY_TOKEN_BUDGET— useful for testing promo periods or rate-capping a specific deployment.
async def record_token_usage(user_id: str, tokens: int) -> None:
"""Record token usage for daily + monthly budget tracking."""
from datetime import datetime, timezone
from app.core.redis import get_redis
r = await get_redis()
month = datetime.now(timezone.utc).strftime("%Y%m")
pipe = r.pipeline()
pipe.incrby(f"token_budget:{user_id}", tokens)
pipe.expire(f"token_budget:{user_id}", 86400)
pipe.incrby(f"token_budget_month:{month}:{user_id}", tokens)
pipe.expire(f"token_budget_month:{month}:{user_id}", 35 * 86400)
await pipe.execute()Prompt caching
The gateway wraps calls in a prompt-response cache (app/core/llm_cache.py). Keys are built from the normalised prompt + the tools list + the model name, so any two calls with identical inputs return the cached result instead of hitting Anthropic again. TTL is conservative because tool output changes frequently, but the savings on classification and short-reply paths are substantial.
Ollama fallback
For offline development, set OLLAMA_HOST and OLLAMA_MODEL in your .env. The gateway will prefer Anthropic when ANTHROPIC_API_KEYis present and fall back to a local Ollama endpoint otherwise. This is purely a dev convenience — production always runs Claude.
Tool schemas
Every LLM call is invoked with the full tool schema list from app/core/tools/__init__.py. The model decides which tool to call; the tool executorruns it with a 5 second timeout. Tool schemas and tool functions live in separate modules so the LLM's visible surface stays in sync with the registered executor.
Failure modes
- Budget denied— the user gets a polite tier-upgrade message. No LLM call is made.
- Anthropic 429 / timeout— the gateway retries once with exponential backoff, then surfaces a user-visible error.
- Redis unavailable during budget check— fails closed (denies the request). Every other Redis path fails open.
- Ollama fallback unreachable— logs a warning and returns a short canned error string to the caller.