Automation fails in predictable ways when it starts as code instead of a process. The fix is simple: write the runbook first, then turn the safest steps into automation.
This guide shows how to go from an incident checklist you can hand to a human at 2 a.m. to a safe, repeatable automation that does not create new outages.
Who this is for Platform engineers, SREs, cloud ops, and FinOps leads who are tired of brittle scripts and one-off fixes. Works for Azure and beyond. Examples lean towards Azure because that is where most people feel the pain. |
Why runbook-first beats script-first
· A runbook captures the decision points. Automation without decision points is just a faster way to be wrong.
· Incidents expose what is actually repeatable. Backlogs often do not.
· A checklist forces guardrails: prerequisites, safe actions, rollback, and evidence.
· Once you have a runbook, automation is mostly plumbing.
The runbook-first loop
Use this loop for each incident type you want to automate.
1. Pick one high-frequency incident with low blast radius.
2. Write the runbook checklist as if a new on-call engineer will use it.
3. Mark each step as: Gather, Decide, Act, Validate, Communicate, or Close.
4. Automate Gather first, then Validate. Automate Act only when it is safe and reversible.
5. Ship with guardrails: scoped permissions, approvals, rate limits, and an escape hatch.
6. Review after the next incident and tighten the runbook.
Operator rule If you cannot explain the automated action in one sentence and show how to roll it back, it is not ready for unattended execution. |
Prerequisites
· A simple incident taxonomy (what counts as Sev0, Sev1, Sev2).
· An alert source (Azure Monitor, Prometheus, Datadog, whatever).
· A place to run automation (Azure Automation, Functions, Logic Apps, GitHub Actions, Jenkins).
· A ticketing or audit trail (ServiceNow, Jira, Azure DevOps) or at least a log workspace.
· One owner for the runbook and one backup. Shared ownership means no ownership.
Step 1: Build the incident checklist
Do not start with tools. Start with a one-page checklist. Keep it boring. Boring is stable.
Checklist skeleton (copy and paste)
RUNBOOK: <Incident Name>
Service / workload: <name>
Owner: <team> Backup: <team/person>
Last tested: <date> Review cadence: <monthly/quarterly>
1) Trigger and scope
- Alert name and threshold:
- What does "bad" look like?
- Known false positives:
- Scope guard: subscription/resource group/cluster/app:
2) Quick triage (first 5 minutes)
- Confirm impact (user-facing? internal?):
- Check current change window / deployments:
- Capture evidence (screenshots/log query links):
- Identify current on-call / incident commander:
3) Gather signals (no changes yet)
- Health/availability:
- CPU/memory/disk:
- Network:
- Recent config changes:
- Cost anomaly signals (optional):
4) Decide
- What is the likely limiter?
- What is safe to do right now vs must wait for approval?
5) Safe actions (reversible first)
- Action A (reversible): <what> Rollback: <how>
- Action B (reversible): <what> Rollback: <how>
- Action C (irreversible): <what> Approval required: <who>
6) Validate
- What "good" looks like:
- Success criteria and time window:
- If not improved, next escalation:
7) Communicate
- Status update template:
- Stakeholders to notify:
- ETA language:
8) Close
- Ticket updates:
- Post-incident note:
- Follow-up work items:Step 2: Tag each runbook step for automation readiness
Take the checklist and tag each line. This is how you decide what to automate first.
· Gather: read-only collection of context (safe to automate early).
· Decide: human judgment or a clear decision tree (automate only when the decision is deterministic).
· Act: a change (automate only if scoped, reversible, and tested).
· Validate: confirm success (safe to automate early).
· Communicate: update ticket or chat (safe to automate early).
· Close: final state and evidence (safe to automate once the rest is reliable).
Automation ladder (safe ordering)
· Level 0: Copy and paste commands inside the runbook. Human runs them.
· Level 1: One-click automation that only gathers data and posts a summary.
· Level 2: Guarded remediation that requires an approval or a ticket id.
· Level 3: Auto-remediation for only the simplest, reversible fixes with tight scope and rate limits.
Step 3: Build the automation wrapper
No matter which platform you use, the wrapper is the same.
Input validation: resource id, environment, severity, ticket id.
Scope guard: allow-list subscriptions/resource groups/tags, deny everything else.
Read-only gather: metrics, logs, deployment history, config drift signals.
Decision: either a deterministic rule or a human approval step.
Action: run only one safe change per execution.
Validation: re-check the signal that triggered the incident.
Audit: write structured logs and update the ticket.
Escape hatch: a kill switch and a manual rollback path.
Example: Azure Monitor alert -> runbook summary -> optional remediation
This example pattern works well in Azure because it keeps the first version safe.
· Alert fires (metric or log).
· Automation runs in read-only mode and posts a runbook summary to the ticket or Teams.
· If a human approves, the automation performs one bounded action and validates.
Copy and paste: KQL to capture incident context
// Example: quick context pack (edit table names to match your workspace)
// 1) What changed recently?
AzureActivity
| where TimeGenerated > ago(6h)
| where ActivityStatusValue == "Success"
| project TimeGenerated, Caller, OperationNameValue, ResourceGroup, ResourceId, ActivitySubstatusValue
| top 50 by TimeGenerated desc
// 2) Heartbeat / agent health (for VMs with AMA/Log Analytics)
Heartbeat
| where TimeGenerated > ago(30m)
| summarize LastSeen=max(TimeGenerated) by Computer
// 3) Common platform errors
AppTraces
| where TimeGenerated > ago(30m)
| where SeverityLevel >= 3
| top 100 by TimeGenerated descPro Tip:
Option A (most common): Export Activity Log to the workspace
In the Microsoft Azure portal:
Go to Monitor → Activity log (pick the subscription).
Diagnostic settings → Add diagnostic setting
Check the categories you want (typically Administrative, Policy, Security, etc.)
Send to Log Analytics workspace → select your workspace → Save
After that, AzureActivity will populate in that workspace (may take a bit).
Option B: Run the query in the correct scope
If you open Logs from a Subscription/Activity Log context (instead of a workspace-only context), AzureActivity may be available without exporting—depends on where you launched Logs from and your selected scope.
Copy and paste: Azure CLI guardrail checks
# Confirm scope and identity
az account show --query "{subscription:id, tenant:tenantId, user:user.name}" -o jsonc
# Check resource tags (use tags as an allow/deny control)
az resource show --ids <RESOURCE_ID> --query "tags" -o jsonc
# Check recent deployments in a resource group
az deployment group list -g <RG_NAME> --query "[0:5].[name,properties.timestamp,properties.provisioningState]" -o tableStep 4: Make remediation safe
Most auto-remediation disasters come from missing guardrails, not from bad intent. Use these.
· Least privilege: split identities for read-only vs change.
· Scoped execution: allow-list resources by tag, naming, or explicit ids.
· Rate limits: cap executions per hour per workload.
· Idempotency: repeated runs should not keep changing things.
· One action per run: do not chain multiple changes.
· Proof of safety: each action must have a rollback and a test case.
· Human approval for anything that changes capacity, networking, or identity.
Step 5: Validate like you mean it
Validation is not 'script returned exit code 0'. It is 'the incident signal is no longer bad'.
· Re-check the metric or log query that triggered the alert.
· Confirm user impact is improving (synthetic test, health probe, p95 latency).
· Capture evidence in the ticket: before/after values and timestamps.
· If validation fails, stop and escalate. Do not try more fixes automatically.
Top 5 failure modes (and how to avoid them)
Automation cannot read logs or metrics
· Fix: assign the right reader role at the right scope (Log Analytics Reader, Monitoring Reader).
· Tip: start with read-only identity and prove visibility before adding write permissions.
The automation runs against the wrong resource
· Fix: implement an allow-list and require resource id validation.
· Tip: use tags like automation=allowed and environment=prod/nonprod.
It keeps running in a loop and makes things worse
· Fix: add rate limits and a state check (do not rerun if still inside cooldown).
· Tip: store last execution time per resource in a table or key-value store.
Remediation changes are irreversible
· Fix: do not automate irreversible actions. Require approval and a change window.
· Tip: if you cannot roll it back, it must be human-led.
Nobody owns the runbook
· Fix: assign a named owner and a review cadence.
· Tip: treat runbooks like code. Version, review, and test them.
What to automate first
If you want quick wins, start with these patterns:
· Context packs: gather last 6 hours of changes, health signals, and top errors into a single summary.
· Evidence capture: store before/after metrics and a timestamped log bundle.
· Ticket hygiene: create incident ticket, assign owner, post status update template.
· Guarded restarts: restart a single instance behind a load balancer with a health check gate.
· Temporary throttles: reduce concurrency or pause a non-critical job if cost or saturation is driving impact.
Quick start (15 minutes)
Pick one incident that happens at least twice a month.
Copy the checklist skeleton and fill out only Trigger, Quick triage, Gather signals, and Validate.
Run the next incident using the runbook. Fix any missing steps immediately.
Automate only the Gather section and have it post to your ticketing system.
After one week, decide whether any action step is safe enough for guarded remediation.
