Use Case

SRE & Reliability Engineering

Observability stacks, SLO frameworks, alerting, and incident response runbooks.

Overview

Observability, SLOs, incident response — reliability work demands rigorous planning and multi-perspective design review. Datarim's pipeline ensures that SLO targets are defined before instrumentation begins, that tool choices are evaluated through Consilium panels, and that every service is verified for metrics, logs, and traces before the work is considered complete. The compliance stage adds an infrastructure checklist covering monitoring, alert thresholds, and least-privilege access.

Example: Observability Stack and SLO Framework for Microservices

An SRE team needs to design and implement an observability stack for a platform with 8 microservices. The work includes metrics instrumentation, centralized logging, distributed tracing, alerting rules, and SLO dashboards.

Pipeline Walkthrough

StageWhat happens
/dr-initScope: metrics, logging, tracing, alerting for 8 services. SLO definitions. Complexity: L4
/dr-prdRequirements: SLO targets (99.9% availability, p99 <500ms), alert channels, on-call rotation, incident runbooks
/dr-planPhases: 1) metrics instrumentation, 2) centralized logging, 3) distributed tracing, 4) alerting rules, 5) SLO dashboards
/dr-designConsilium panel: SRE + Security + DevOps evaluate Prometheus vs Datadog, ELK vs Loki, Jaeger vs Tempo
/dr-doImplement phase by phase. Each service instrumented independently
/dr-qaVerify: all services emit metrics, logs searchable, traces connected across services, alerts fire correctly
/dr-complianceInfrastructure checklist: monitoring configured, alert thresholds set, rollback plan, security (least-privilege)
/dr-archive (Step 0.5)Lesson: starting with SLO definitions before instrumentation kept the team focused on what matters

Key Benefits

  • SLO-driven design — defining availability and latency targets first ensures instrumentation serves business goals, not vanity metrics
  • Multi-perspective tool evaluation — Consilium panels bring SRE, Security, and DevOps perspectives together when choosing between Prometheus, Datadog, or other stacks
  • Per-service verification — QA checks each service independently for metrics emission, log searchability, and trace connectivity
  • Infrastructure hardening — the compliance stage verifies alert thresholds, on-call configuration, and least-privilege access controls

Relevant Agents

Which agents are most active in this use case:

  • SRE — SLO framework, observability design, incident response planning
  • DevOps — infrastructure provisioning and monitoring configuration
  • Security — access controls, log integrity, and audit trail verification
  • Architect — system-wide observability architecture and tool selection
  • Compliance — infrastructure checklist and alert configuration verification

Complexity Routing

How complexity levels apply to SRE and reliability engineering:

  • L1 — Adjust an alert threshold or add a new metric to an existing dashboard
  • L2 — Instrument a single service with metrics and structured logging
  • L3 — Set up alerting rules with escalation paths and on-call rotation for a service group
  • L4 — Design and implement a complete observability stack with SLO framework across 8+ microservices