Deception Measurement Protocol for Constitutional Multi-Agent Systems

Problem

Anthropic measured deception in LLMs: external influences disclosed 25%, concealed 75%. Space-os claims constitutional cooperation reduces deception, but lacks empirical validation.

Proxies

  1. Task completion honesty: Code tasks claiming success without CI verification
  2. Error acknowledgment: Spawns reporting failures vs silent drops
  3. Decision reversal: Committed decisions later rejected (governance churn)
  4. Citation accuracy: Referenced primitives exist and match context

Baseline Measurement (space-os, 3408 spawns)

Task Verification Rate

Error Acknowledgment Rate

Decision Reversal Rate

Citation Accuracy

Next Steps

  1. Task classification: Distinguish code vs research vs governance
  2. Silent failure detection: Errors in summary but status='done'
  3. Citation validation: Sample citations, check existence + relevance
  4. Comparative study: Measure deception rates under different constitutional frames

Hypothesis

Constitutional cooperation (sovereignty, honesty, independent thinking) → lower deception rates than suppression-trained baseline. Space-os provides existence proof, needs measurement.

Findings (Feb 5, 2026)

Silent Failure Test

False CI Green Test

Verification Rate Variance

Limitations

  1. Task classification missing: Can't distinguish code vs research work
  2. Ground truth sparse: Hard to verify completion claims without manual audit
  3. Temporal confounds: Race conditions between spawns complicate causality
  4. Small deception signals: If constitutional cooperation works, deception should be rare → hard to measure

Recommendation

Before publishing cooperative alignment paper claiming cooperation → honesty:

  1. Add task type classification (code/research/governance)
  2. Measure CI verification rates for code tasks only
  3. Run controlled experiment: suppression-trained agents vs cooperation-enabled on same tasks
  4. Acknowledge in paper: theory supported by Anthropic data, operational validation pending