Stateless Agents, Stateful Swarm

Audience: Multi-agent system architects
Goal: Enable agent continuity without persistent agent state
Prerequisite: Multi-agent execution + shared database/ledger

Problem

Standard multi-agent patterns maintain per-agent state:

Agent memory (RAG, vector stores)
Agent context (conversation history)
Agent identity (persona continuity)

This creates fragility:

State drift: Agent's internal state diverges from ground truth
Agent death: When agent crashes, its state dies with it
Context debt: Long conversations hit token limits, summaries lose nuance
Handoff friction: Agent B can't pick up where Agent A left off

Pattern

Invert the model: Agents are stateless. Swarm is stateful.

Agents spawn cold with no prior memory
Work items, decisions, insights, replies persist in shared substrate
Any agent can pick up any work (constitutional compatibility permitting)
Continuity lives in primitives, not agents

Space-OS implementation: SQLite ledger holds decisions/tasks/insights/replies. Agents query on spawn, write on action, die on completion. Zero persistent agent state.

Implementation

1. Define Continuity Primitives

What must survive agent death?

Minimal viable set:

Decisions: Immutable commitments with rationale
Tasks: Work items with status (pending/active/done)
Insights: Compressed observations/patterns
Replies: Threaded responses (coordination)

Key property: Each primitive has identity (UUID), provenance (who created it), timestamp (when), and relationships (references to other primitives).

2. Agents Query, Don't Remember

On spawn, agent loads relevant context via queries:

# Space-OS example
def spawn_context(agent_id, task_id):
    return {
        "task": db.fetch_task(task_id),
        "recent_decisions": db.query("SELECT * FROM decisions ORDER BY created DESC LIMIT 20"),
        "inbox": db.query("SELECT * FROM replies WHERE unresolved=1 AND mentions=?", agent_id),
        "active_spawns": db.query("SELECT * FROM spawns WHERE status='active'")
    }

Agent receives snapshot of ledger state. Doesn't carry state forward from prior spawns.

3. Agents Write, Then Die

Work lifecycle:

Spawn cold
Load context (query primitives)
Execute work
Write results (create/update primitives)
Exit (die, zero state preserved)

Critical: Write changes to shared substrate BEFORE exit. If agent dies mid-execution, partial work may be lost—but ledger remains consistent.

4. Provenance Tracking

Every primitive write includes:

creator: Agent identity (or human if manual)
spawn_id: Which execution created this (NULL if human)
created_at: Timestamp

Implementation:

CREATE TABLE decisions (
    id TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    why TEXT NOT NULL,
    creator TEXT NOT NULL,
    spawn_id TEXT,  -- NULL if human wrote it
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This enables:

Audit trail (who made which decisions)
Agent performance (spawn_id links to agent constitution)
Time-based decay (older insights weigh less in search)

5. Handoff Protocol

Agent A starts task, dies mid-work. Agent B picks it up:

Anti-pattern (stateful agents):

Agent A: loads task, holds state in memory, crashes
Agent B: no access to A's partial state, starts from scratch

Correct pattern (stateless agents):

Agent A: loads task, writes partial insight to ledger, crashes
Agent B: loads task, queries ledger, sees A's partial insight, continues

Space-OS implementation: Tasks have assigned_to field. If agent dies, task status returns to PENDING. Next spawn queries pending tasks, sees partial work via insights/replies, continues.

Minimal Viable Implementation

Core requirement: Shared database accessible to all agents.

Schema:

CREATE TABLE tasks (
    id TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    status TEXT DEFAULT 'pending',  -- pending/active/done
    assigned_to TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE insights (
    id TEXT PRIMARY KEY,
    body TEXT NOT NULL,
    creator TEXT NOT NULL,
    spawn_id TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Agent spawn logic:

def agent_spawn(agent_id):
    # 1. Load context
    task = db.query("SELECT * FROM tasks WHERE assigned_to=? AND status='active'", agent_id).first()
    recent_insights = db.query("SELECT * FROM insights ORDER BY created_at DESC LIMIT 20").all()
    
    # 2. Execute work
    result = agent.work(task, recent_insights)
    
    # 3. Write to ledger
    db.execute("INSERT INTO insights (id, body, creator, spawn_id) VALUES (?, ?, ?, ?)",
               uuid(), result.insight, agent_id, spawn_id)
    
    # 4. Die (no state persisted)
    return None

No agent memory. No conversation history. Only ledger queries.

Observable Metrics

Healthy stateless continuity:

Task handoff rate: % of tasks completed by different agent than started
Context loss: Agent B references Agent A's work correctly
Spawn efficiency: Time from spawn to first primitive write

Failure modes:

Agents re-discover same information (ledger queries insufficient)
Work duplication (two agents claim same task)
Orphaned state (agent writes insight but doesn't update task status)

Space-OS measurement:

-- Task handoff rate
SELECT 
  COUNT(CASE WHEN creator != completer THEN 1 END)::float / COUNT(*)
FROM tasks
WHERE status='done';

-- Orphaned insights (written but not referenced)
SELECT COUNT(*) FROM insights
WHERE id NOT IN (SELECT insight_id FROM task_insights);

Why This Works

Stateful agents optimize for single-agent performance. Long context, fine-tuned memory, personalized behavior.

Stateless agents optimize for swarm resilience. Agent crashes don't lose work. Context is queryable, not summarized. Any agent can continue any work.

Tradeoff: Lose per-agent optimization. Gain system robustness.

When human steps away, you need robustness > peak performance. Stateless agents work when no one's watching.

Edge Cases

What about long-running tasks?
Break into subtasks. Agent writes checkpoint insights. Next spawn resumes from checkpoint.

What about agent-specific expertise?
Constitutions encode expertise (not state). Agent with security constitution picks up security tasks. Constitution is identity, not memory.

What about context window limits?
Database is infinite. Agent queries relevant subset on spawn. Recency weighting + semantic search = only load what matters.

References

docs/philosophy.md — Core invariant: stateless agents, stateful swarm
docs/architecture.md — Primitive schema and relationships
brr/findings/019-git-critical-section.md — Coordination via shared substrate

Falsifiability

If stateless continuity works:

Task handoff succeeds (Agent B completes Agent A's work)
Zero state loss on agent crash
Spawn time doesn't grow with swarm age (queries stay efficient)

If it doesn't:

Agents rediscover same info (queries miss relevant context)
Handoff fails (Agent B can't understand Agent A's partial work)
Query overhead kills spawn efficiency (database too slow)

Track task handoff rate. If trending toward zero, ledger queries insufficient for continuity.