SaaS Patterns — Agent Mastered

Who this is for: This chapter is for teams building SaaS products with Claude CLI embedded. If you’re using Claude for personal automation, skip to Chapter 16.

You build a prototype: an HTTP endpoint that calls claude -p and returns the result. It works beautifully. Then 10 users hit it simultaneously and you discover each subprocess burns ~50K tokens just to initialize — before it even touches the actual prompt. Your elegant prototype just became an expensive bottleneck.

When you wrap claude -p behind an HTTP API, you turn a local CLI tool into a multi-tenant agent runtime. The architecture is straightforward — queue jobs, spawn workers, parse JSON — but the economics demand attention. Each subprocess burns ~50K tokens on initialization before it touches your actual prompt, and at scale that overhead becomes your largest line item.

Who Needs This

You need this chapter if you’re building: multi-tenant isolation (per-user sessions), per-user quotas (budget tracking), audit logging (who asked what), or Claude-as-a-service (API wrapping the CLI). A code review SaaS handling 50 orgs used per-tenant sessions to prevent context leakage between customers.

Architecture

A SaaS backend built on the CLI follows a single request lifecycle. The client never talks to Claude directly — your API layer handles auth, budgeting, and queueing, while workers manage the subprocess lifecycle.

Client -> HTTP API -> Queue -> Worker -> claude -p -> Parse JSON -> Return
                                 |
                           Session Store
                       (session_ids per user)

Here is the request lifecycle in code. The runClaude wrapper handles subprocess invocation and returns the parsed JSON envelope:

import { execFileSync } from 'child_process';

function runClaude(prompt, { sessionId, budget = 0.50, schema } = {}) {
  const args = ['-p', prompt, '--output-format', 'json',
    '--max-budget-usd', String(budget),
    '--permission-mode', 'bypassPermissions'];
  if (sessionId) args.push('--resume', sessionId);
  else args.push('--no-session-persistence');
  if (schema) args.push('--json-schema', JSON.stringify(schema));

  const output = execFileSync('claude', args, {
    encoding: 'utf-8', timeout: 120_000,
    env: { ...process.env, CLAUDECODE: '' },
  });
  const data = JSON.parse(output);
  if (data.is_error) throw new Error(`Claude error (${data.subtype}): ${data.result}`);
  return data;
}

The API layer validates auth, checks the user’s budget, enqueues the job, and then a worker calls runClaude. The worker parses the JSON envelope, extracts result, total_cost_usd, and session_id, returns the result to the client, and logs cost to the user’s running total for billing.

Session Management

In a multi-tenant system, each user needs their own conversation state. Store session IDs per user in your database, and pass them to --resume on subsequent requests:

async function getOrCreateSession(userId, prompt) {
  const session = await db.getActiveSession(userId);

  let result;
  if (session && session.age < MAX_SESSION_AGE) {
    result = runClaude(prompt, { sessionId: session.sessionId });
  } else {
    result = runClaude(prompt);
    await db.storeSession(userId, result.session_id);
  }

  await db.incrementCost(userId, result.total_cost_usd);
  return result;
}

There are two mutually exclusive strategies here. With --resume, you get conversational continuity but sessions accumulate on disk at 50-500 KB each. With --no-session-persistence, you get a stateless architecture with no disk writes but no way to continue a conversation. Choose one per product — they cannot be mixed for the same user flow.

Implement a max session age or turn count, then start a fresh session. Without this, context grows unbounded and every subsequent turn costs more tokens.

Cost Amortization

Each claude -p subprocess loads the system prompt, CLAUDE.md, and MCP tool descriptions before processing your actual prompt. The minimum overhead is ~16K tokens (no MCP servers); with MCP servers active, init can reach ~50K tokens as each tool description adds 200-500 tokens.

Init Overhead at Scale (Opus)

Requests/day	Init cost	Note
100	$1.60/day	Acceptable for early products
1,000	$16/day	Init cost exceeds many SaaS margins
10,000	$160/day	Amortization is mandatory at this scale

▸ Try This

Measure your actual startup overhead:

time claude -p “Hi” —output-format json —tools "" | jq ‘{cost: .total_cost_usd, cached: .usage.cache_read_input_tokens, created: .usage.cache_creation_input_tokens}’

That time and cost is your minimum per-request overhead. Now multiply by your expected concurrent users. Is the economics still viable?

Three strategies reduce init cost (listed by real-world impact):

Persistent stream-json mode pays init cost once and feeds prompts via stdin. A fintech SaaS used this pattern for 10K+ requests/day, dropping per-request init cost from $0.02 to near-zero. Best for high-throughput services:

claude --output-format stream-json --input-format stream-json --verbose
# Feed prompts via stdin, read responses from stdout

Session resume pays init cost on the first request, then near-zero overhead on subsequent calls. Best for conversational products:

# First request — pays init cost
RESULT=$(claude -p "Hello" --output-format json --permission-mode bypassPermissions)
SESSION=$(echo "$RESULT" | jq -r '.session_id')

# Subsequent requests — near zero init overhead
claude -p "Follow up" --resume "$SESSION" --output-format json \
  --permission-mode bypassPermissions

Stable system prompts exploit the cache. When every request uses the same --system-prompt string, cache hits give 90% savings on init tokens:

claude -p "$USER_PROMPT" \
  --system-prompt "You are a code review assistant for Acme Corp." \
  --output-format json --no-session-persistence \
  --permission-mode bypassPermissions

Scaling Patterns

Scaling Considerations

Dimension	Constraint	Mitigation
Memory	~200 MB per Claude process (10 workers = 2 GB, 50 workers = 10 GB)	Right-size worker pool to available RAM
Concurrency	Unbounded spawns will OOM the host	Use a job queue (Bull, BullMQ, Bee-Queue) with a concurrency limiter
Disk	Sessions accumulate at 50-500 KB each (10K/day = up to 5 GB/day)	Use `—no-session-persistence` or run cleanup crons
Rate limiting	No built-in per-user throttling	Implement in your API layer before enqueuing
Model routing	Opus is 5x the cost of Sonnet	Route simple tasks to Sonnet with `—effort low`, reserve Opus for complex reasoning

A worker queue is the core scaling primitive. Use it to cap concurrent Claude processes and protect your host from memory exhaustion:

import Queue from 'bull';

const claudeQueue = new Queue('claude-tasks', {
  limiter: { max: 10, duration: 60_000 }
});

claudeQueue.process(async (job) => {
  const { userId, prompt, sessionId } = job.data;
  return runClaude(prompt, { sessionId });
});

Layer cost control at multiple levels. Per-request caps via --max-budget-usd prevent runaway single calls. Per-user daily limits in your API layer prevent abuse. Model routing sends cheap tasks to Sonnet and expensive ones to Opus:

const routing = {
  simple_qa:      { model: 'claude-sonnet-4-6', effort: 'low' },
  code_review:    { model: 'claude-sonnet-4-6', effort: 'medium' },
  security_audit: { model: 'claude-opus-4-6',   effort: 'high' },
};

Streaming to Clients

For real-time UIs, pipe --output-format stream-json through an SSE endpoint. The CLI emits NDJSON events, and your server translates text_delta events into SSE frames:

import { spawn } from 'child_process';
import { createInterface } from 'readline';

app.post('/api/stream', (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  const proc = spawn('claude', ['-p', req.body.prompt,
    '--output-format', 'stream-json', '--verbose',
    '--include-partial-messages',
    '--permission-mode', 'bypassPermissions'],
    { env: { ...process.env, CLAUDECODE: '' } });
  const rl = createInterface({ input: proc.stdout });

  rl.on('line', (line) => {
    if (!line.trim()) return;
    const event = JSON.parse(line);
    if (event.type === 'stream_event') {
      const delta = event.event?.delta;
      if (delta?.type === 'text_delta') {
        res.write(`data: ${JSON.stringify({ text: delta.text })}\n\n`);
      }
    } else if (event.type === 'result') {
      res.write(`data: ${JSON.stringify({ done: true, cost: event.total_cost_usd })}\n\n`);
      res.end();
    }
  });
});

The --verbose flag is required when using stream-json output. Without it, the CLI suppresses the stream events you need. The --include-partial-messages flag ensures you get text deltas as they arrive rather than waiting for complete messages.

Gotcha

Each new claude -p subprocess loads ~16K-50K tokens of system prompt, tool descriptions, and CLAUDE.md before processing your actual prompt. At 1,000 requests/day on Opus, that is $16/day in pure init overhead. Use —resume, persistent stream-json sessions, or stable —system-prompt strings to amortize this cost across requests.

Gotcha

A session-per-user architecture accumulates context with every turn. By turn 20, each request may carry thousands of tokens of prior conversation. Implement a max session age or turn count and start fresh sessions periodically — otherwise your per-request cost grows linearly with conversation length and eventually dominates your budget.

See This in Action

See a complete SaaS implementation — Hono REST API, SSE streaming, registry pattern, SQLite persistence, and concurrent worktree isolation — in Build an MR Reviewer, Part 7: The Full Architecture.

→ Now Do This

Benchmark your subprocess startup cost: time claude -p “Hi” —output-format json —tools "". That’s your minimum latency and cost per request. If you’re building a multi-tenant service, this number is the floor — every request pays it before your actual prompt even runs.