Command Pipeline

/dr-verify

Standalone tri-layer self-verification: deterministic floor + cross-model peer-review (zero-flag UX, 6-step provider chain) + native runtime dispatch

Overview

/dr-verify runs a three-layer self-verification pipeline that catches gaps at progressively deeper levels, cheapest-first. Layer 1 is a deterministic shell pipeline (zero LLM cost). Layer 2 is a cross-model peer-review with clean external context — provider auto-resolves via a 6-step chain (zero-flag UX). Layer 3 is a native runtime dispatch with multi-agent fan-out. Findings carry an explicit source_layer tag plus a peer_review_mode taxonomy (cross_vendor / cross_claude_family / same_model_isolated) so the audit log preserves provenance and dispatch class.

Usage

/dr-verify {TASK-ID}                          # zero-flag — provider auto-resolves
/dr-verify {TASK-ID} --stage do
/dr-verify {TASK-ID} --floor-only             # Layer 1 only, fast pre-merge gating
/dr-verify {TASK-ID} --peer-provider deepseek # explicit override (chain step #1)

The Three Layers

  1. Layer 1: Deterministic floor — runs dev-tools/dr-verify-floor.sh, a pure shell pipeline performing AC coverage grep, file-touched audit, test-presence parse, and shellcheck on dev-tools/scripts. Zero LLM cost; runs in seconds. Emits JSONL findings with source_layer: "floor" on stdout.
  2. Layer 2: Cross-model peer-review — dispatches an adversarial review in clean external context. Provider auto-resolves via a 6-step chain (see § Provider Resolution below); the previous «default DeepSeek» behaviour is now chain step #4, not a hardcoded literal. The --task-id propagation is mandatory — without it the downstream token-cost tooling cannot filter logs by task. Findings tagged source_layer: "peer_review" + peer_review_provider: <name> + peer_review_mode: cross_vendor|cross_claude_family|same_model_isolated.
  3. Layer 3: Native runtime dispatch — verification agents on the local runtime:
    • Claude path (canonical): 3 parallel read-only subagents (reviewer + tester + security). Findings union-merged and deduped by tuple (artifact_ref, ac_criteria, category); higher severity wins on collision.
    • Codex path [experimental] fallback: single-prompt loop with canonical adversarial framing. Demoted from canonical — prefer Layer 2 cross-model coverage on Codex too.

Provider Resolution (Zero-Flag UX)

When --peer-provider is not set, the helper dev-tools/resolve-peer-provider.sh walks a 6-step chain. The first step that yields a provider wins.

  1. --peer-provider <name> CLI flag — explicit override.
  2. ./datarim/config.yaml peer_review.provider — per-project, team-shared (committed to repo).
  3. ~/.config/datarim/config.yaml peer_review.provider — per-user XDG, machine-local.
  4. ~/.config/coworker/profiles.yaml code profile recommended_provider — coworker default (typically cross_vendor like DeepSeek).
  5. Cross-Claude-family fallback — dispatches agents/peer-reviewer.md at model: sonnet in a fresh subagent context. Claude Code runtime only; covered by Claude subscription, no external API key required.
  6. Same-model isolated last resort — Opus reviewing Opus output (Codex degraded path or final fallback). Least-mature pattern; emits stderr WARN.

Provider whitelist: deepseek | moonshot | openrouter | sonnet | haiku | opus | none. Unknown values reject with exit 1 (supply-chain mitigation: malicious PR injecting provider: typosquat-host blocked at parse).

peer_review_mode Taxonomy (3-tier)

  • cross_vendor — DeepSeek / Moonshot / OpenRouter via coworker ask. Different company, different training run, different RLHF — strongest mitigation of self-agreement bias.
  • cross_claude_family — Sonnet 4.6 reviewing Opus 4.7 output via Claude Code subagent dispatch. Different model checkpoints with different post-training runs, isolated subagent context. Middle tier — covered by Claude subscription. First measured tier: empirical bias delta vs same-model self-critique remains under measurement in the active dogfood window.
  • same_model_isolated — Opus reviewing Opus or Codex single-prompt loop. Same model family, same training distribution. Last-resort fallback only; least-mature pattern.

Codex CLI degraded mode: when CODEX_RUNTIME=1 is set in env, chain step #5 is skipped and step #6 is taken. The helper writes WARN: Codex runtime detected, falling back to same_model_isolated mode to stderr; orchestrator propagates this warning into the audit-log.

Findings Schema

Each finding follows a standardized schema (canonical in skills/self-verification.md § Findings Schema):

  • source_layer — enum: floor, peer_review, dispatch
  • severity — enum: high, medium, low
  • category — enum: correctness, completeness, consistency, safety
  • evidence{type, source, excerpt} with type in {file_quote, test_output, absent} (absent auto-discards)
  • ac_criteria — array of AC labels the finding maps to (may be empty)
  • peer_review_provider — optional, Layer 2 only (provider name)
  • peer_review_mode — optional, Layer 2 only (3-tier taxonomy enum)
  • peer_review_provider_source_layer — optional, Layer 2 only (chain step that resolved: cli_flag, per_project_config, per_user_config, coworker_default, fallback_subagent, fallback_isolated)
  • check_name — optional, Layer 1 only (e.g. ac_coverage_grep, shellcheck)

JSONL Emission Discipline (Layer 2 prompts)

Findings are emitted ONLY when a check FAILS or surfaces NEW DRIFT. PASS-as-finding entries (e.g. {"check_name":"F001 cleared, no finding"}) are rejected — confirmations belong in the final-line summary {cleared_iter1: [...], total_new_findings: N}, not in the array. Keeps the audit log signal-dense.

Verdicts

  • PASS — only low-severity (or zero) non-discarded findings
  • CONDITIONAL — ≥1 medium and zero high
  • BLOCKED — ≥1 high (merge/archive blocked until resolved)

--floor-only Flag

The --floor-only flag short-circuits at Layer 1 — ideal for fast pre-merge gating in CI. Completes in seconds with zero LLM cost, blocks only on structural / shellcheck-error issues. Layer 2 and Layer 3 are skipped.

Verification Outcome Tagging

At /dr-archive time the operator fills a verification_outcome block in the archive frontmatter (canonical schema in templates/archive-template.md): caught_by_verify, missed_by_verify, false_positive, n_a, dogfood_window (operator-supplied window-id used as grouping key, not a date range). The aggregator dev-tools/measure-prospective-rate.sh --since <YYYY-MM-DD> walks all archive-*.md files and computes caught_per_5_tasks with a decision_hint for the next pipeline gate.

Example Session (zero-flag UX, no external API key)

> /dr-verify {TASK-ID} --stage all --max-iter 2

Layer 1 — floor: dr-verify-floor.sh --task {TASK-ID} --stage all
  → 2 findings (medium, safety, check_name=shellcheck)
  → exit 0 (no high-severity, proceed)

Layer 2 — peer_review (provider auto-resolution chain)
  Step #1 --peer-provider flag    → not set
  Step #2 ./datarim/config.yaml   → not set
  Step #3 ~/.config/datarim/...   → not set
  Step #4 coworker --profile code → not set
  Step #5 cross-Claude-family     → MATCH (subagent peer-reviewer @ sonnet)
  Dispatching agents/peer-reviewer.md, model=sonnet, isolated context
  → 1 finding (medium, correctness, peer_review_mode=cross_claude_family)

Layer 3 — dispatch runtime=claude
  3 parallel agents: reviewer / tester / security
  → reviewer: 1 finding (completeness)
  → tester: 0 findings
  → security: 0 findings

Aggregate: 4 unique findings post-dedupe
Verdict: CONDITIONAL (0 high, 4 medium)
source_layer_breakdown: {floor: 2, peer_review: 1, dispatch: 1}
peer_review_mode_breakdown: {cross_claude_family: 1}
Audit: datarim/qa/verify-{TASK-ID}-all-1.md (chmod a-w)

Related Commands

  • /dr-qa — multi-layer quality verification (different surface; complementary)
  • /dr-archive — operator fills verification_outcome block here