Skip to content
AGNT

Backend · Core modules

LLM gateway.

Every LLM call in AGNT goes through this one file. It abstracts Anthropic Claude (with an Ollama fallback for offline dev), enforces per-user daily token budgets, limits concurrency with a global semaphore, and caches prompt-response pairs to avoid redundant calls. File: agnt-backend/app/core/llm_gateway.py.

Why a gateway

Calling Anthropic directly from every tool and every handler would scatter rate limits, retry logic, and token accounting across the codebase. The gateway centralises all of that so there's exactly one place that speaks to the model. Swap the provider and nothing else changes.

Shared client

The Anthropic client is lazily created on first call and held as a module singleton. A global asyncio semaphore caps concurrent calls at 30 so the backend never overwhelms the Anthropic API with bursty webhook traffic.

pythonapp/core/llm_gateway.py
_LLM_SEMAPHORE = asyncio.Semaphore(30)  # Max 30 concurrent LLM calls
_anthropic_client: anthropic.AsyncAnthropic | None = None


def _get_anthropic_client() -> anthropic.AsyncAnthropic:
    global _anthropic_client
    if _anthropic_client is None:
        _anthropic_client = anthropic.AsyncAnthropic(api_key=settings.ANTHROPIC_API_KEY)
    return _anthropic_client

Token budgets

Every user belongs to a tier. Each tier has a daily token cap tracked in Redis with a 24-hour TTL. Before any LLM call, the gateway checks the budget. If the budget check itself fails (Redis unreachable), the gateway fails closed— denying the request rather than risking a runaway bill.

Daily token limits

TierDaily tokensAudience
free50,000Anyone who just created an account
starter200,000Consumer Starter subscription
pro1,000,000Consumer Pro subscription
venue_starter200,000Venue Starter tier
venue_growth500,000Venue Growth tier
venue_pro1,000,000Venue Pro tier
pythonapp/core/llm_gateway.py — check_token_budget
_DAILY_TOKEN_LIMITS = {
    "free": 50_000,
    "starter": 200_000,
    "pro": 1_000_000,
    "venue_starter": 200_000,
    "venue_growth": 500_000,
    "venue_pro": 1_000_000,
}


async def check_token_budget(user_id: str, tier: str) -> bool:
    """Check if user is within their daily token budget. Returns True if allowed."""
    try:
        from app.core.redis import get_redis
        r = await get_redis()
        key = f"token_budget:{user_id}"
        used = await r.get(key)
        limit = _DAILY_TOKEN_LIMITS.get(tier, _DAILY_TOKEN_LIMITS["free"])
        if used and int(used) >= limit:
            return False
        return True
    except Exception:
        logger.warning("Token budget check failed (Redis unavailable), denying request")
        return False  # Fail closed

Usage recording

Every completion updates both a daily counter (24 hour TTL) and a monthly counter (35 day TTL). The monthly counter feeds an optional hard stop controlled by LLM_MONTHLY_TOKEN_BUDGET— useful for testing promo periods or rate-capping a specific deployment.

pythonapp/core/llm_gateway.py — record_token_usage
async def record_token_usage(user_id: str, tokens: int) -> None:
    """Record token usage for daily + monthly budget tracking."""
    from datetime import datetime, timezone
    from app.core.redis import get_redis
    r = await get_redis()
    month = datetime.now(timezone.utc).strftime("%Y%m")
    pipe = r.pipeline()
    pipe.incrby(f"token_budget:{user_id}", tokens)
    pipe.expire(f"token_budget:{user_id}", 86400)
    pipe.incrby(f"token_budget_month:{month}:{user_id}", tokens)
    pipe.expire(f"token_budget_month:{month}:{user_id}", 35 * 86400)
    await pipe.execute()

Prompt caching

The gateway wraps calls in a prompt-response cache (app/core/llm_cache.py). Keys are built from the normalised prompt + the tools list + the model name, so any two calls with identical inputs return the cached result instead of hitting Anthropic again. TTL is conservative because tool output changes frequently, but the savings on classification and short-reply paths are substantial.

Ollama fallback

For offline development, set OLLAMA_HOST and OLLAMA_MODEL in your .env. The gateway will prefer Anthropic when ANTHROPIC_API_KEYis present and fall back to a local Ollama endpoint otherwise. This is purely a dev convenience — production always runs Claude.

Tool schemas

Every LLM call is invoked with the full tool schema list from app/core/tools/__init__.py. The model decides which tool to call; the tool executorruns it with a 5 second timeout. Tool schemas and tool functions live in separate modules so the LLM's visible surface stays in sync with the registered executor.

Failure modes

  • Budget denied— the user gets a polite tier-upgrade message. No LLM call is made.
  • Anthropic 429 / timeout— the gateway retries once with exponential backoff, then surfaces a user-visible error.
  • Redis unavailable during budget check— fails closed (denies the request). Every other Redis path fails open.
  • Ollama fallback unreachable— logs a warning and returns a short canned error string to the caller.

Related