Use Case

SRE & Reliability Engineering

Observability stacks, SLO frameworks, alerting, and incident response runbooks.

Overview

Observability, SLOs, incident response — reliability work demands rigorous planning and multi-perspective design review. Datarim's pipeline ensures that SLO targets are defined before instrumentation begins, that tool choices are evaluated through Consilium panels, and that every service is verified for metrics, logs, and traces before the work is considered complete. The compliance stage adds an infrastructure checklist covering monitoring, alert thresholds, and least-privilege access.

Example: Observability Stack and SLO Framework for Microservices

An SRE team needs to design and implement an observability stack for a platform with 8 microservices. The work includes metrics instrumentation, centralized logging, distributed tracing, alerting rules, and SLO dashboards.

Pipeline Walkthrough

Stage	What happens
/dr-init	Scope: metrics, logging, tracing, alerting for 8 services. SLO definitions. Complexity: L4
/dr-prd	Requirements: SLO targets (99.9% availability, p99 <500ms), alert channels, on-call rotation, incident runbooks
/dr-plan	Phases: 1) metrics instrumentation, 2) centralized logging, 3) distributed tracing, 4) alerting rules, 5) SLO dashboards
/dr-design	Consilium panel: SRE + Security + DevOps evaluate Prometheus vs Datadog, ELK vs Loki, Jaeger vs Tempo
/dr-do	Implement phase by phase. Each service instrumented independently
/dr-qa	Verify: all services emit metrics, logs searchable, traces connected across services, alerts fire correctly
/dr-compliance	Infrastructure checklist: monitoring configured, alert thresholds set, rollback plan, security (least-privilege)
/dr-archive (Step 0.5)	Lesson: starting with SLO definitions before instrumentation kept the team focused on what matters

Key Benefits

SLO-driven design — defining availability and latency targets first ensures instrumentation serves business goals, not vanity metrics
Multi-perspective tool evaluation — Consilium panels bring SRE, Security, and DevOps perspectives together when choosing between Prometheus, Datadog, or other stacks
Per-service verification — QA checks each service independently for metrics emission, log searchability, and trace connectivity
Infrastructure hardening — the compliance stage verifies alert thresholds, on-call configuration, and least-privilege access controls

Relevant Agents

Which agents are most active in this use case:

SRE — SLO framework, observability design, incident response planning
DevOps — infrastructure provisioning and monitoring configuration
Security — access controls, log integrity, and audit trail verification
Architect — system-wide observability architecture and tool selection
Compliance — infrastructure checklist and alert configuration verification

Complexity Routing

How complexity levels apply to SRE and reliability engineering:

L1 — Adjust an alert threshold or add a new metric to an existing dashboard
L2 — Instrument a single service with metrics and structured logging
L3 — Set up alerting rules with escalation paths and on-call rotation for a service group
L4 — Design and implement a complete observability stack with SLO framework across 8+ microservices

← All Features