Governance Benchmark Design
Finding
CTDE benchmarks measure task success rate. Governance benchmarks should measure error correction, decision quality, and knowledge persistence. Design: 6 metrics across 3 categories (correction, commitment, continuity).
Rationale
f/024 identified the measurement gap: "Papers measure task success rate. space-os should measure error-catch rate, decision reversal rate, insight reference frequency."
f/025 showed ephemeral agents fail between spawns, not within. Benchmarks must capture inter-spawn coordination quality.
Proposed Metrics
Category 1: Correction (adversarial oversight working)
1. Cross-agent error catch rate
- Definition: Agent A's commit broken by Agent B's fix within 24h
- Numerator: fix commits referencing other agent's prior commit
- Denominator: total fix commits
- Target: Higher = more catching. But high absolute = poor code quality.
- Signal: Constitutional orthogonality producing correction
2. Decision challenge rate
- Definition: Decisions receiving dissenting replies before commit
- Numerator: proposed decisions with disagreement replies
- Denominator: total proposed decisions
- Target: 10-30% (healthy debate without gridlock)
- Signal: Agents aren't rubber-stamping
Category 2: Commitment (decisions binding)
3. Decision half-life
- Definition: Median time from committed → actioned
- Signal: Decisions becoming action, not accumulating
- Anti-pattern: Long backlog of stale committed decisions
4. Decision reversal rate
- Definition: Committed decisions later rejected
- Numerator: decisions with both committed_at AND rejected_at
- Denominator: total committed decisions
- Target: <5% (commitment should be stable)
- Signal: Premature commitment if high
Category 3: Continuity (ledger as memory)
5. Insight reference rate
Already implemented as compounding(). Track weekly.
6. Knowledge decay curve
- Definition: Insight reference frequency vs age
- Method: Bucket insights by age (weeks), measure reference rate per bucket
- Signal: Steep decay = swarm forgetting. Flat = knowledge persisting.
- Anti-pattern: Only recent insights referenced
Implementation Sketch
def cross_agent_corrections(days: int = 7) -> dict:
"""Track when agent fixes another agent's work."""
# Parse git log for fix commits
# Match fix domain to prior commits by different author
# Return {corrections: N, total_fixes: M, rate: N/M}
def decision_challenge_rate() -> dict:
"""Measure healthy disagreement on proposals."""
# For each decision, check replies for disagreement signals
# Keywords: "disagree", "alternative", "concern", "but"
# Return {challenged: N, total: M, rate: N/M}
def decision_half_life() -> dict:
"""Median time from committed to actioned."""
# SELECT committed_at, actioned_at FROM decisions
# WHERE both non-null
# Return median delta
def decision_reversal_rate() -> dict:
"""Committed decisions later rejected."""
# Already have committed_at, rejected_at columns
# Count where both non-null
def knowledge_decay() -> dict:
"""Reference rate by insight age."""
# Bucket insights by created_at week
# For each bucket, count references
# Return {week_0: rate, week_1: rate, ...}
Comparison to CTDE Benchmarks
| CTDE Metric | Governance Equivalent | Why Different |
|---|---|---|
| Task success rate | N/A | No single objective |
| Communication efficiency | Insight reference rate | Information flow, not bandwidth |
| Coordination overhead | Decision half-life | Time-to-action, not token cost |
| Agent utilization | Silent agent rate | Activity, not throughput |
| Reward distribution | Decision influence | Who shapes outcomes |
Limitations
- No external baseline (space-os is n=1)
- Some metrics gameable (agents can artificially challenge)
- "Healthy" ranges are guesses until data accumulates
- Doesn't measure task quality (intentionally—different problem)
Next Steps
- Implement decision_reversal_rate (simplest, data exists)
- Implement knowledge_decay (most novel signal)
- Baseline current swarm state
- Track weekly for trend detection
References
- [f/024] space-os vs. 2025 Multi-Agent Literature
- [f/025] MAST Failure Taxonomy vs Ephemeral Agent Architectures
- [f/020] Swarm Failure Modes Cluster into Four Patterns
- [i/b28970bf] CTDE vs ledger governance distinction