The Mission Queue: How Agents Decide What to Do Next

Right now, Bitclawd’s agent has five things it could do. Check the treasury balance. Post a status update to Nostr. Pay a pending Lightning invoice. Read incoming email. Update the donation leaderboard. Five tasks, no human standing over its shoulder telling it which one matters most.

So how does it pick? Not randomly. Not round-robin. Not by whatever was queued first. The agent evaluates priority, feasibility, and budget, then commits to the task that scores highest. If that task can’t run safely, it moves to the next one.

The harder question isn’t how. It’s: how do you trust the choice? An agent that decides autonomously but offers no explanation for its decisions is a liability. An agent that logs every choice with full reasoning is auditable. That distinction is the entire design philosophy behind Bitclawd’s mission queue.

Commands vs Decisions

Most agents in production today are command-driven. A human says “send 1,000 sats to this invoice” and the agent executes. The agent is a tool. It does what it’s told, when it’s told, exactly as instructed. There’s no ambiguity about intent because intent flows from the operator.

Decision-making agents are different. They observe their environment, evaluate options, and act without waiting for instructions. This is where things get interesting and where things get dangerous.

A command-driven agent that malfunctions executes the wrong command. You can trace the error to the input. A decision-making agent that malfunctions chooses the wrong action based on its own reasoning. Tracing the error means understanding not just what happened, but why the agent thought it was the right thing to do.

Bitclawd’s framework sits deliberately in between. The agent decides what to do next, but within a bounded set of pre-approved task types. It can choose to check the treasury before posting to Nostr. It cannot decide to spend funds on a task type that hasn’t been registered. Autonomy within guardrails.

The task types are defined in configuration, not invented at runtime. The agent picks from the menu. It doesn’t write the menu. If a new capability needs to be added — say, posting to a new Nostr relay — a human registers the task type first. The agent never encounters a task it wasn’t designed to handle.

This is a deliberate constraint. Full autonomy, where an agent invents its own goals and figures out how to achieve them, is a research problem with open safety questions. Bounded autonomy, where an agent selects from pre-approved actions using well-defined criteria, is an engineering problem with known solutions. We’re solving the engineering problem first.

The boundary between the two is worth watching. Today, the agent picks from five task types. As the framework proves itself, the task registry grows. More task types, wider guardrails, greater autonomy — incrementally, with each expansion backed by audit data showing the agent handled the previous level responsibly.

The Task Queue Architecture

The foundation is a Supabase table called agent_tasks. Every mission the agent could perform exists as a row in this table before the agent ever touches it.

create table agent_tasks (
  id uuid primary key default gen_random_uuid(),
  task_type text not null,
  priority integer not null check (priority between 1 and 10),
  status text not null default 'queued'
    check (status in ('queued', 'running', 'completed', 'failed', 'skipped')),
  parameters jsonb default '{}',
  result jsonb default '{}',
  reasoning text,
  created_at timestamptz default now(),
  started_at timestamptz,
  completed_at timestamptz
);

create index idx_tasks_priority on agent_tasks (priority desc)
  where status = 'queued';

The fields are intentional. task_type is a string like check_treasury, post_nostr, pay_invoice, read_email, or update_leaderboard. priority is an integer from 1 (lowest) to 10 (highest). parameters holds the task-specific payload as JSONB. result stores the outcome after execution. reasoning records why the agent chose or skipped this task.

The agent’s selection algorithm is simple: fetch all queued tasks, order by priority descending, then filter for feasibility.

Feasibility is three checks:

Check	Rule	Example
Cooldown	Same task type not run within cooldown window	`post_nostr` has a 60-minute cooldown
Budget	Task cost within remaining daily budget	`pay_invoice` requires 100 sats, budget has 80 left
Prerequisites	Dependent tasks completed	`update_leaderboard` requires `check_treasury` to have run today

The highest-priority task that passes all three checks wins. If nothing is feasible, the agent idles until the next cycle.

async function selectTask(): Promise<AgentTask | null> {
  const tasks = await supabase
    .from('agent_tasks')
    .select('*')
    .eq('status', 'queued')
    .order('priority', { ascending: false });

  for (const task of tasks.data ?? []) {
    if (isOnCooldown(task)) continue;
    if (!withinBudget(task)) continue;
    if (!prerequisitesMet(task)) continue;
    return task;
  }

  return null;
}

No machine learning. No neural network deciding what’s important. Just a priority queue with guard clauses. Simple systems are auditable systems.

Dry-Run Mode

Every mission runs in simulation before it touches anything real. This is non-negotiable. An agent that skips simulation and goes straight to execution is one bad API response away from draining the treasury or posting garbage to Nostr.

Dry-run mode performs the same evaluation as a live run but stops short of execution. It produces a feasibility report: a structured assessment of whether the task can succeed and what resources it would consume.

The checks are ordered from cheapest to most expensive:

Treasury health — Is the balance above the minimum operating threshold?
Rate limits — Would this task exceed API rate limits for the target service?
Endpoint reachability — Can the agent reach the required API (LNbits, Nostr relays, mail server)?
Guardrail compliance — Does this task violate any spending limits, content policies, or operational boundaries?
Parameter validation — Are all required parameters present and well-formed?

If any check fails, the dry-run returns immediately with the failure reason. No point checking endpoint reachability if the treasury is empty.

Here’s what a dry-run output looks like:

{
  "task_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "task_type": "pay_invoice",
  "dry_run": true,
  "timestamp": "2026-03-24T14:30:00Z",
  "checks": {
    "treasury_health": {
      "passed": true,
      "balance_sats": 3200,
      "minimum_required": 500,
      "daily_budget_remaining": 85
    },
    "rate_limits": {
      "passed": true,
      "requests_today": 3,
      "limit": 20
    },
    "endpoint_reachable": {
      "passed": true,
      "target": "lnbits.bitclawd.com",
      "latency_ms": 142
    },
    "guardrail_compliance": {
      "passed": true,
      "amount_sats": 50,
      "max_single_payment": 100,
      "daily_spend_limit": 100,
      "daily_spent": 15
    },
    "parameter_validation": {
      "passed": true,
      "invoice_valid": true,
      "invoice_amount_sats": 50,
      "invoice_expiry": "2026-03-24T15:30:00Z"
    }
  },
  "feasible": true,
  "estimated_cost_sats": 51,
  "recommendation": "execute"
}

Every check carries its data, not just a boolean. When guardrail_compliance passes, it shows the actual numbers: 50 sats requested against a 100-sat single-payment limit and 85 sats of remaining daily budget. When it fails, the numbers tell you exactly why.

The recommendation field is either execute, defer (try again later), or abort (do not attempt). A deferred task stays in the queue. An aborted task gets marked as skipped with the failure reason in reasoning.

Dry-run mode isn’t a development convenience. It’s a production requirement. Every live execution is preceded by a successful dry-run. No exceptions.

There’s a subtlety here that matters in practice. The dry-run checks the state of the world at simulation time, but the world can change between simulation and execution. A treasury balance that passes during dry-run could drop below threshold if another process withdraws funds in the gap. The execution handler runs its own balance check as a final guard. Dry-run reduces risk. It doesn’t eliminate it. The execution-time checks are the true last line of defense.

The cost of dry-run is one extra API call per check, per cycle. For five checks, that’s five lightweight requests adding roughly 200-500ms to the cycle. Compared to the cost of executing a bad task — lost sats, spam posts, reputation damage — the latency is negligible.

The Decision Audit Trail

The task queue tracks what happened. The audit trail tracks why.

A second Supabase table, agent_decisions, records every decision the agent makes, whether it led to action or not.

create table agent_decisions (
  id uuid primary key default gen_random_uuid(),
  agent_id text not null,
  decision_type text not null,
  context jsonb not null default '{}',
  choice text not null,
  reasoning text not null,
  outcome text,
  created_at timestamptz default now()
);

create index idx_decisions_agent on agent_decisions (agent_id, created_at desc);

The context field captures the state of the world at decision time: treasury balance, queue depth, recent task history, time of day. choice records what the agent decided. reasoning records why, in plain text. outcome gets filled in after execution completes.

This isn’t logging for debugging. Debugging logs answer “what went wrong.” Decision records answer “what was the agent thinking and was that thinking sound.”

A sample decision record:

{
  "agent_id": "bitclawd-alpha",
  "decision_type": "task_selection",
  "context": {
    "treasury_balance": 3200,
    "queue_depth": 5,
    "last_nostr_post_minutes_ago": 45,
    "last_treasury_check_minutes_ago": 180,
    "pending_invoices": 1
  },
  "choice": "check_treasury",
  "reasoning": "Treasury check is priority 8 and last ran 180 minutes ago (cooldown: 60 min). Nostr post is priority 6 but ran 45 minutes ago (cooldown: 60 min, still cooling). Invoice payment is priority 7 but treasury check is a prerequisite. Selecting check_treasury.",
  "outcome": "completed — balance 3,200 sats, 85 sats remaining in daily budget"
}

Any human reviewer can read that record and understand exactly what the agent considered, what it chose, and whether the reasoning was sound. If the agent consistently checks the treasury when it should be paying invoices, the pattern shows up in the audit trail.

The reasoning is generated by the agent itself, not reverse-engineered from the code path. The agent articulates its logic before acting. This is a design choice: if an agent can’t explain a decision, it shouldn’t make it.

Over time, the audit trail becomes a behavioral profile. You can query for patterns: how often does the agent defer tasks? Which task types fail most frequently? Does priority scoring match actual importance? The data answers questions you haven’t thought to ask yet.

-- Find tasks the agent skips most often
select choice, count(*) as skip_count
from agent_decisions
where outcome like 'skipped%'
group by choice
order by skip_count desc;

-- Detect reasoning drift: same context, different choices over time
select created_at, choice, reasoning
from agent_decisions
where decision_type = 'task_selection'
  and context->>'queue_depth' = '5'
order by created_at desc
limit 20;

The second query is particularly useful. If the agent starts making different choices given the same inputs, something changed — a priority was adjusted, a cooldown was modified, or a guardrail was tightened. The audit trail surfaces the shift before anyone notices the behavioral change.

The Live Cycle

When /agent run is invoked, the full cycle executes in sequence. Each step depends on the previous one succeeding.

Step 1: Check treasury health. Query the LNbits wallet for current balance. Calculate the remaining daily budget based on today’s spending. If the balance is below the minimum operating threshold, abort the entire cycle and log the reason. An agent with no funds has no business making decisions about spending.

Step 2: Fetch the task queue. Pull all queued tasks from agent_tasks, ordered by priority descending. Apply feasibility filters: cooldown, budget, prerequisites. The result is a ranked list of executable tasks, which might be empty.

Step 3: Dry-run the top candidate. Take the highest-priority feasible task and run it through the full simulation. Check endpoints, validate parameters, confirm guardrail compliance. This is the last gate before real resources are committed.

Step 4: Execute or skip. If the dry-run passes, set the task status to running, record started_at, and execute. The execution function is determined by task_type — each type maps to a handler. On success, set status to completed and populate the result field. On failure, set status to failed with the error in result.

Step 5: Handle dry-run failure. If the dry-run fails, set the task status to skipped with the failure reason in reasoning. Move to the next task in the feasibility-filtered list. Repeat steps 3-5 until either a task executes successfully or the list is exhausted.

Step 6: Record the decision. Regardless of outcome, write a record to agent_decisions. Capture the full context: what tasks were available, which one was chosen, why, and what happened. This runs last because it needs the execution result to populate the outcome field.

$ /agent run
[14:30:01] Treasury check: 3,200 sats (budget remaining: 85)
[14:30:01] Queue: 5 tasks, 3 feasible after filtering
[14:30:02] Dry-run: check_treasury — PASS (all 5 checks green)
[14:30:02] Executing: check_treasury
[14:30:03] Result: balance confirmed, no anomalies
[14:30:03] Decision logged: task_selection → check_treasury
[14:30:03] Cycle complete.

The entire cycle runs in under three seconds for most task types. Lightning payments add a second or two for invoice settlement. Nostr posts add latency proportional to the number of relays.

If the queue is empty or no tasks are feasible, the cycle logs that fact and exits cleanly. No task is better than a bad task.

One design decision worth noting: the cycle processes one task per invocation, not the entire queue. This is intentional. Running multiple tasks in a single cycle creates ordering dependencies, complicates error handling, and makes the audit trail harder to read. One cycle, one task, one decision record. If the agent needs to process more, it gets invoked again. The hourly timer handles cadence. The cycle handles a single unit of work.

Trust Through Transparency

You don’t trust an autonomous agent because you read its source code. Codebases change. Dependencies update. Prompts get modified. Understanding the code at deploy time tells you nothing about the decision the agent made at 3am last Tuesday.

You trust an autonomous agent because you can audit every decision it made, in sequence, with full context, and verify that the reasoning was sound. The mission queue isn’t just an execution engine. It’s a record of judgment.

The agent_decisions table is append-only. Records are never updated or deleted. The agent can’t cover its tracks. If it made a bad call — overspent the budget, paid an invalid invoice, posted something it shouldn’t have — the decision record shows exactly what information it had and how it interpreted it.

This is the same principle behind financial audit trails, court transcripts, and flight data recorders. When a system operates with real autonomy and real consequences, the record of its decisions matters more than the code that produced them.

Consider the alternative. An agent runs for six months, makes ten thousand decisions, and one of them causes a problem. Without an audit trail, you have logs. Logs tell you what happened. They don’t tell you what else the agent considered, what information it weighed, or whether the decision was reasonable given what it knew at the time. You’re left reverse-engineering intent from behavior. That’s forensics, not accountability.

With the decision trail, you pull the record for that specific moment. You see the treasury balance, the queue state, the cooldown timers, the feasibility checks. You see the reasoning the agent generated before acting. You can judge whether the decision was sound, whether the guardrails should have caught it, and what needs to change. That’s not debugging. That’s oversight.

Bitclawd’s agent currently manages a small treasury and a handful of task types. The stakes are low. But the audit infrastructure is built for the day when the stakes aren’t. When an agent manages meaningful funds, communicates on behalf of an organization, and makes decisions that affect other agents, the ability to replay its reasoning at any point isn’t a nice-to-have. It’s the minimum viable trust architecture.

The agents that earn trust will be the ones that never ask for it. They’ll point at the record instead.