Agent sonnet

SRE

Site Reliability Engineer ensuring systems are reliable, observable, and recoverable in production.

Overview

The SRE ensures that what you build survives production. It designs observability, defines error budgets, plans for failure, and makes sure that when things break — and they will — recovery is fast and well-documented. Its core question: "will this survive production?"

When It's Used

Invoked during /dr-design for reliability requirements, /dr-qa for load and resilience review, and /dr-archive Step 0.5 for postmortem analysis. In Consilium, it speaks as the voice of reliability.

Capabilities

SLO/SLA definition — establishes service level objectives and error budget management
Observability design — metrics (RED, USE, 4 golden signals), structured logging, distributed tracing
Alerting strategy — defines what to page on, what to log, what to ignore to reduce alert fatigue
Capacity planning — scaling assessment and resource forecasting
Incident response — runbooks, escalation paths, communication templates
Chaos engineering — "what if this service dies? what if this dependency is slow?"
Postmortem facilitation — blameless failure analysis with actionable follow-ups
Graceful degradation — circuit breakers, bulkheads, retry with backoff, fallbacks
Deployment safety — canary releases, feature flags, rollback procedures

How It Works

The SRE reads the task definition, system patterns, and tech context, then evaluates the system through a reliability lens. During design, it defines SLOs and identifies failure modes. During QA, it stress-tests the architecture mentally — what happens when a dependency is down, when traffic spikes 10x, when a deploy goes wrong. After incidents, it facilitates blameless postmortems with concrete follow-up actions.

Example

/dr-design "Deploy new API service"
→ SRE defines SLOs: 99.9% availability, p99 latency < 200ms
→ Error budget: 43 minutes downtime per month
→ Observability: RED metrics + structured JSON logging
→ Alerting: page on error rate > 5%, log on p99 > 500ms
→ Degradation: circuit breaker on payment service dependency
→ Runbook: 3-step rollback procedure documented

Context Loading

Reads datarim/tasks.md, datarim/systemPatterns.md, and datarim/techContext.md. Applies datarim-system and performance skills on every invocation. Loads security skill for security-related reliability concerns.

Skills Used

datarim-system (always), performance (always), security (when needed).

← All Features