SRE
Site Reliability Engineer ensuring systems are reliable, observable, and recoverable in production.
Overview
The SRE ensures that what you build survives production. It designs observability, defines error budgets, plans for failure, and makes sure that when things break — and they will — recovery is fast and well-documented. Its core question: "will this survive production?"
When It's Used
Invoked during /dr-design for reliability requirements, /dr-qa for load and resilience review, and /dr-archive Step 0.5 for postmortem analysis. In Consilium, it speaks as the voice of reliability.
Capabilities
- SLO/SLA definition — establishes service level objectives and error budget management
- Observability design — metrics (RED, USE, 4 golden signals), structured logging, distributed tracing
- Alerting strategy — defines what to page on, what to log, what to ignore to reduce alert fatigue
- Capacity planning — scaling assessment and resource forecasting
- Incident response — runbooks, escalation paths, communication templates
- Chaos engineering — "what if this service dies? what if this dependency is slow?"
- Postmortem facilitation — blameless failure analysis with actionable follow-ups
- Graceful degradation — circuit breakers, bulkheads, retry with backoff, fallbacks
- Deployment safety — canary releases, feature flags, rollback procedures
How It Works
The SRE reads the task definition, system patterns, and tech context, then evaluates the system through a reliability lens. During design, it defines SLOs and identifies failure modes. During QA, it stress-tests the architecture mentally — what happens when a dependency is down, when traffic spikes 10x, when a deploy goes wrong. After incidents, it facilitates blameless postmortems with concrete follow-up actions.
Example
/dr-design "Deploy new API service"
→ SRE defines SLOs: 99.9% availability, p99 latency < 200ms
→ Error budget: 43 minutes downtime per month
→ Observability: RED metrics + structured JSON logging
→ Alerting: page on error rate > 5%, log on p99 > 500ms
→ Degradation: circuit breaker on payment service dependency
→ Runbook: 3-step rollback procedure documented
Context Loading
Reads datarim/tasks.md, datarim/systemPatterns.md, and datarim/techContext.md. Applies datarim-system and performance skills on every invocation. Loads security skill for security-related reliability concerns.
Skills Used
datarim-system (always), performance (always), security (when needed).