Operationsadvanced

Fleet Operations Playbook

Monitor, diagnose, and maintain your agent fleet — health checks, job audits, cost monitoring, and incident response.

AGNT Operations Deskverified 2026-04-144 steps60-90 minutesClaude Opus 4.6, Claude Sonnet 4.6

Why this playbook

An agent fleet is a production system — it needs monitoring, maintenance, and incident response just like any other service. The difference is that agent failures are often subtle: a stuck heartbeat, a silently failing cron job, or a gradual cost increase that doesn't trigger alerts until the bill arrives.

This playbook chains four operations prompts that cover the full ops lifecycle: check fleet health, audit scheduled jobs, monitor spend, and write incident reports when things go wrong. Run the first three weekly as preventive maintenance.

Built from AGNT's own internal ops runbook. These are the exact prompts our Sentinel agent uses to monitor the fleet.

Prerequisites

Claude Code with filesystem access to fleet logs
SSH access to the deployment environment (Railway, VPS, or similar)
Access to the scheduler's job list (APScheduler config or crontab)
Billing API access or cost dashboard credentials

Input requirements

Input	Type	Required	Description
Fleet log directory	directory path	Yes	Path to heartbeat logs, error logs, and agent activity logs. Typically: /var/log/agnt/ or the Railway log stream.
Scheduler config path	file path	Yes	Path to the APScheduler job configuration or crontab. Used to audit job health and detect stuck/orphaned jobs.
Billing timeframe	string	No	Period to analyze for spend anomalies. Default: last 7 days. Format: '7d', '30d', or ISO date range.

Step-by-step workflow

Review fleet health and heartbeat status

Open prompt

Start with the big picture. This prompt analyzes heartbeat logs to identify stuck agents, degraded performance, and communication failures. It produces a fleet health matrix: green/yellow/red per agent.

Pay special attention to yellow-status agents — they're still running but showing early signs of degradation (increasing latency, higher error rates, or missed heartbeats).

Audit scheduled jobs for failures and drift

Open prompt

Scheduled jobs are the backbone of fleet automation — if they drift or fail silently, your agents lose critical maintenance routines. This prompt audits every scheduled job: last run time, success rate, execution duration, and schedule drift.

The output flags jobs that have silently stopped, jobs running significantly longer than their historical average, and jobs whose schedules have drifted from their intended cadence.

← Fleet health matrix from step 1 (to correlate agent issues with job failures)

Monitor and detect cost anomalies

Open prompt

This prompt analyzes your billing data to detect unusual spend patterns: sudden spikes, gradual creep, and per-agent cost outliers. It compares current spend against a rolling 30-day baseline.

Cost anomalies often signal operational issues before health checks catch them — a stuck retry loop, an agent making redundant API calls, or an unexpected traffic spike hitting your LLM budget.

← Agent list from step 1← Job schedule from step 2

Document incidents and generate postmortems

Open prompt

If any of the previous steps found critical issues, this prompt generates a structured incident report: timeline, root cause, impact, resolution, and follow-up actions.

Even if nothing is broken, run this step to document the review itself — it creates a 'weekly ops review' note that feeds into your knowledge base and helps track fleet health trends over time.

← Health matrix from step 1← Job audit from step 2← Spend report from step 3

Expected outputs

Markdown + JSON

Fleet health matrix

Per-agent health status (green/yellow/red) with latency, error rate, and heartbeat metrics.

Produced by step 1

Markdown

Job audit report

Status of every scheduled job: last run, success rate, duration trend, schedule drift analysis.

Produced by step 2

Markdown + JSON

Spend anomaly report

Cost analysis with per-agent breakdown, anomaly flags, and trend comparison against 30-day baseline.

Produced by step 3

Markdown

Incident report / ops review

Structured incident report or weekly ops review with timeline, root cause, and follow-up actions.

Produced by step 4

Tool requirements

Claude Code with filesystem + SSH access
Access to fleet scheduler (APScheduler / cron)
Access to spend/billing dashboard or API

Troubleshooting

Heartbeat logs are empty or missing

Check that the heartbeat cron job is running (step 2 will also catch this). If using Railway, logs may be in a different stream — check `railway logs` with the correct service filter. If self-hosted, verify the log rotation config hasn't archived today's logs.

Spend API returns 403 Forbidden

Billing API access requires admin-level credentials, not the standard API key. Check your role permissions in the provider dashboard. For Railway, use the team-level API token, not the project token.

Cron doctor reports all jobs healthy but fleet is degraded

Jobs may be running on schedule but producing errors that don't affect the job's exit status. Cross-reference job output logs (not just cron logs) with the health matrix from step 1. The most common culprit: a job that 'succeeds' but its downstream effect fails.

Hacker News X / Twitter Reddit LinkedIn Threads

Share as social post

Fleet Operations Playbook 4 steps, 60-90 minutes. Monitor, diagnose, and maintain your agent fleet — health checks, job audits, cost monitoring, and incident response. https://agntdot.com/playbooks/fleet-operations-playbook

225 / 280 chars

Related playbooks

Obsidian + Wiki + AGNT Knowledge Playbook

Wire your Obsidian vault into AGNT's knowledge layer — audit links, persist agent memory, and query the graph.

Venue Onboarding Playbook

Take a venue from signed contract to live AI agent in one session — scrape, configure, optimize.

Why this playbook

Prerequisites

Input requirements

Step-by-step workflow

Review fleet health and heartbeat status

Audit scheduled jobs for failures and drift

Monitor and detect cost anomalies

Document incidents and generate postmortems

Expected outputs

Fleet health matrix

Job audit report

Spend anomaly report

Incident report / ops review

Tool requirements

Troubleshooting

Related playbooks

Obsidian + Wiki + AGNT Knowledge Playbook

Venue Onboarding Playbook

Run the playbook.