Token Confound Experiment: Isolating Architectural Effects in Constitutional Orthogonality
Problem
Constitutional orthogonality uses multiple agents with incompatible objectives, producing higher quality outputs than single agents. But orthogonal systems consume more tokens (3 agents × 3 iterations = 9 agent-turns vs 1 agent × 1 turn).
Confound: Quality improvements could reflect increased compute budget, not architectural benefits.
Experiment: Test whether orthogonality produces quality gains under token-matched conditions.
Hypotheses
H1 (architectural benefit): Orthogonal agents outperform cooperative agents and single agents when token budgets are equal.
H2 (compute scaling): Quality scales with total tokens regardless of architecture—single agent with 9× tokens matches orthogonal system.
H3 (diminishing returns): Single agent quality saturates before reaching orthogonal system's token budget.
Experimental Design
Conditions (Token-Matched)
All conditions use identical token budget T (e.g., 30k tokens):
Single agent baseline
- 1 agent, 1 turn
- T tokens total
Cooperative multi-agent
- 3 agents, same objective ("produce high-quality output")
- Sequential iteration (agent A → agent B → agent C → repeat until convergence)
- T tokens total
Orthogonal multi-agent
- 3 agents, incompatible objectives (precision, procedure, strategy)
- Sequential iteration until convergence
- T tokens total
Controls
- Model: Identical weights across all conditions (e.g., Claude 3.7 Sonnet)
- Task: Fixed task set (legal reasoning, code generation, strategic planning)
- Stopping criterion: Token budget exhausted OR convergence detected (no agent proposes changes)
- Randomization: Task order randomized, agent order randomized within multi-agent conditions
Metrics
Quality (primary):
- Human evaluation: correctness, completeness, clarity (blind evaluation)
- Error detection rate: inject known errors, measure catch rate
- External validation: task-specific ground truth where available (code: passes tests, legal: matches precedent)
Process (secondary):
- Convergence speed: iterations until no changes proposed
- Error surface coverage: which error types caught (factual, procedural, strategic)
- Agreement dynamics: frequency of objections, revision depth
Task Selection
Use tasks with objective quality criteria:
- Legal reasoning: Contract clause analysis against statutory requirements (ground truth: expert review)
- Code generation: Implement specified API (ground truth: test suite passage, no security vulnerabilities)
- Strategic planning: Resource allocation under constraints (ground truth: mathematical optimality)
N=30 tasks per domain (90 total), balanced for difficulty.
Expected Outcomes
If H1 (architectural benefit):
- Orthogonal > Cooperative > Single, token-matched
- Error detection rate higher for orthogonal
- Different error types caught by different constitutions
If H2 (compute scaling):
- Quality ∝ tokens across architectures
- No architectural effect
If H3 (diminishing returns):
- Single agent quality plateaus
- Orthogonal agents utilize tokens more efficiently
Token Budget Determination
Pilot study to determine T:
- Run single agent on 5 tasks with increasing budgets (5k, 10k, 20k, 40k tokens)
- Measure quality saturation point
- Set T = saturation point for single agent
- Hypothesis: Orthogonal system achieves higher quality at this budget
Implementation Notes
Token tracking:
- Count prompt + completion tokens per agent turn
- Stop condition: cumulative tokens ≥ T
- Allow final turn to exceed T to enable convergence (report actual token usage)
Convergence detection:
- Agent explicitly states "no changes" OR returns draft unchanged
- All agents must converge for stopping
Constitutional specifications (orthogonal condition):
- Precision agent: Rejects unsupported claims, requires evidence citations, flags uncertainty
- Procedure agent: Enforces format requirements, scope boundaries, procedural compliance
- Strategy agent: Maximizes impact, identifies opportunities, challenges conservative framing
Cooperative specification:
- All agents share objective: "Produce the highest quality output possible"
Analysis Plan
Primary comparison: Quality scores across 3 conditions (one-way ANOVA, post-hoc pairwise)
Token efficiency: Quality per 1k tokens consumed
Error analysis: Categorize caught vs missed errors by type, condition
Failure mode frequency: Constitutional drift, evidence fabrication, agreement theater rates across conditions
Limitations
- Single model family: Results may not generalize across model architectures
- Task domain dependence: Some domains may benefit more from orthogonality than others
- Human evaluation bias: Evaluators may infer architecture from output characteristics
- Token counting variability: Different APIs measure tokens differently
- Convergence ambiguity: Stopping criteria affects token budget utilization
Falsifiability
Evidence that would refute H1:
- Orthogonal agents perform no better than cooperative agents or single agents under token matching
- Error detection rate equivalent across conditions
- Quality gains disappear when token-matched
Evidence that would support H1:
- Orthogonal > Cooperative > Single (p < 0.05, effect size d > 0.5)
- Orthogonal agents catch error categories missed by single/cooperative agents
- Quality gains persist under token matching
Future Extensions
- Token allocation strategies: Equal per agent vs weighted by constitutional role
- Constitution diversity: Test N=2, N=4, N=5 orthogonal agents
- Model heterogeneity: Mix model families (Claude, GPT-4, Gemini) within orthogonal system
- Adversarial validation: Inject adversarial inputs designed to exploit single-agent blindspots
References
- Constitutional orthogonality paper: brr/papers/constitutional-orthogonality-arxiv.md
- Methodology: brr/methodology/001-replicating-constitutional-orthogonality.md
- Stateless swarm comparison: brr/papers/stateless-agents-stateful-swarm-arxiv.md (CTDE architectural parallel)
Status
Phase: Methodology design (not yet executed)
Blocker: Requires compute budget for 90 tasks × 3 conditions = 270 experimental runs
Next step: Pilot study (N=5 tasks) to validate token budget determination and identify implementation issues