Skip to content
AGNT
All playbooks
Operationsadvanced

Fleet Operations Playbook

Monitor, diagnose, and maintain your agent fleet — health checks, job audits, cost monitoring, and incident response.

AGNT Operations Desk4 steps60-90 minutesClaude Opus 4.6, Claude Sonnet 4.6

Why this playbook

An agent fleet is a production system — it needs monitoring, maintenance, and incident response just like any other service. The difference is that agent failures are often subtle: a stuck heartbeat, a silently failing cron job, or a gradual cost increase that doesn't trigger alerts until the bill arrives.

This playbook chains four operations prompts that cover the full ops lifecycle: check fleet health, audit scheduled jobs, monitor spend, and write incident reports when things go wrong. Run the first three weekly as preventive maintenance.

Built from AGNT's own internal ops runbook. These are the exact prompts our Sentinel agent uses to monitor the fleet.

Prerequisites

  • Claude Code with filesystem access to fleet logs
  • SSH access to the deployment environment (Railway, VPS, or similar)
  • Access to the scheduler's job list (APScheduler config or crontab)
  • Billing API access or cost dashboard credentials

Input requirements

InputTypeRequiredDescription
Fleet log directorydirectory pathYesPath to heartbeat logs, error logs, and agent activity logs. Typically: /var/log/agnt/ or the Railway log stream.
Scheduler config pathfile pathYesPath to the APScheduler job configuration or crontab. Used to audit job health and detect stuck/orphaned jobs.
Billing timeframestringNoPeriod to analyze for spend anomalies. Default: last 7 days. Format: '7d', '30d', or ISO date range.

Step-by-step workflow

1

Review fleet health and heartbeat status

Open prompt

Start with the big picture. This prompt analyzes heartbeat logs to identify stuck agents, degraded performance, and communication failures. It produces a fleet health matrix: green/yellow/red per agent.

Pay special attention to yellow-status agents — they're still running but showing early signs of degradation (increasing latency, higher error rates, or missed heartbeats).

2

Audit scheduled jobs for failures and drift

Open prompt

Scheduled jobs are the backbone of fleet automation — if they drift or fail silently, your agents lose critical maintenance routines. This prompt audits every scheduled job: last run time, success rate, execution duration, and schedule drift.

The output flags jobs that have silently stopped, jobs running significantly longer than their historical average, and jobs whose schedules have drifted from their intended cadence.

Fleet health matrix from step 1 (to correlate agent issues with job failures)
3

Monitor and detect cost anomalies

Open prompt

This prompt analyzes your billing data to detect unusual spend patterns: sudden spikes, gradual creep, and per-agent cost outliers. It compares current spend against a rolling 30-day baseline.

Cost anomalies often signal operational issues before health checks catch them — a stuck retry loop, an agent making redundant API calls, or an unexpected traffic spike hitting your LLM budget.

Agent list from step 1Job schedule from step 2
4

Document incidents and generate postmortems

Open prompt

If any of the previous steps found critical issues, this prompt generates a structured incident report: timeline, root cause, impact, resolution, and follow-up actions.

Even if nothing is broken, run this step to document the review itself — it creates a 'weekly ops review' note that feeds into your knowledge base and helps track fleet health trends over time.

Health matrix from step 1Job audit from step 2Spend report from step 3

Expected outputs

Markdown + JSON

Fleet health matrix

Per-agent health status (green/yellow/red) with latency, error rate, and heartbeat metrics.

Produced by step 1
Markdown

Job audit report

Status of every scheduled job: last run, success rate, duration trend, schedule drift analysis.

Produced by step 2
Markdown + JSON

Spend anomaly report

Cost analysis with per-agent breakdown, anomaly flags, and trend comparison against 30-day baseline.

Produced by step 3
Markdown

Incident report / ops review

Structured incident report or weekly ops review with timeline, root cause, and follow-up actions.

Produced by step 4

Tool requirements

  • Claude Code with filesystem + SSH access
  • Access to fleet scheduler (APScheduler / cron)
  • Access to spend/billing dashboard or API

Troubleshooting

Heartbeat logs are empty or missing
Check that the heartbeat cron job is running (step 2 will also catch this). If using Railway, logs may be in a different stream — check `railway logs` with the correct service filter. If self-hosted, verify the log rotation config hasn't archived today's logs.
Spend API returns 403 Forbidden
Billing API access requires admin-level credentials, not the standard API key. Check your role permissions in the provider dashboard. For Railway, use the team-level API token, not the project token.
Cron doctor reports all jobs healthy but fleet is degraded
Jobs may be running on schedule but producing errors that don't affect the job's exit status. Cross-reference job output logs (not just cron logs) with the health matrix from step 1. The most common culprit: a job that 'succeeds' but its downstream effect fails.

Share as social post

Fleet Operations Playbook 4 steps, 60-90 minutes. Monitor, diagnose, and maintain your agent fleet — health checks, job audits, cost monitoring, and incident response. https://agntdot.com/playbooks/fleet-operations-playbook

225 / 280 chars

Related playbooks

Run the playbook.

Open each prompt in order, feed the outputs forward, and ship the workflow end-to-end.