I write infrastructure runbooks, incident response playbooks, and operational procedures for DevOps, SRE, and platform engineering teams — designed to be used at 3am by an on-call engineer who needs fast, unambiguous steps, not narrative prose.
Runbook scope: service degradation and outage response; database failure and recovery (Postgres, MySQL, Redis); Kubernetes node pressure, pod crash loops, and OOMKill events; certificate expiry and rotation; deployment failure and rollback; container registry and build pipeline failures; DNS resolution failures; and scheduled maintenance procedures.
Incident response playbooks: severity classification criteria; escalation path and notification tree; communication templates (internal Slack, external status page, customer communication); post-mortem structure and action item tracking; and on-call handoff procedures.
Format: decision-tree structure with exact commands, expected outputs, and conditional branches ('if X, proceed to step 5; if Y, escalate to DBA'). Every runbook ends with a verification step and escalation path.
My infrastructure background means the failure scenarios I document are ones I've debugged myself — not generic templates.