SaaS Patterns

Multi-tenant architecture, queue-based processing, scaling math

When you wrap claude -p behind an HTTP API, you turn a local CLI tool into a multi-tenant agent runtime. The architecture is straightforward — queue jobs, spawn workers, parse JSON — but the economics demand attention. Each subprocess burns ~50K tokens on initialization before it touches your actual prompt, and at scale that overhead becomes your largest line item.

Architecture

A SaaS backend built on the CLI follows a single request lifecycle. The client never talks to Claude directly — your API layer handles auth, budgeting, and queueing, while workers manage the subprocess lifecycle.

Client -> HTTP API -> Queue -> Worker -> claude -p -> Parse JSON -> Return
|
Session Store
(session_ids per user)

Here is the request lifecycle in code. The runClaude wrapper handles subprocess invocation and returns the parsed JSON envelope:

import { execFileSync } from 'child_process';
function runClaude(prompt, { sessionId, budget = 0.50, schema } = {}) {
const args = ['-p', prompt, '--output-format', 'json',
'--max-budget-usd', String(budget),
'--permission-mode', 'bypassPermissions'];
if (sessionId) args.push('--resume', sessionId);
else args.push('--no-session-persistence');
if (schema) args.push('--json-schema', JSON.stringify(schema));
const output = execFileSync('claude', args, {
encoding: 'utf-8', timeout: 120_000,
env: { ...process.env, CLAUDECODE: '' },
});
const data = JSON.parse(output);
if (data.is_error) throw new Error(`Claude error (${data.subtype}): ${data.result}`);
return data;
}

The API layer validates auth, checks the user’s budget, enqueues the job, and then a worker calls runClaude. The worker parses the JSON envelope, extracts result, total_cost_usd, and session_id, returns the result to the client, and logs cost to the user’s running total for billing.

Session Management

In a multi-tenant system, each user needs their own conversation state. Store session IDs per user in your database, and pass them to --resume on subsequent requests:

async function getOrCreateSession(userId, prompt) {
const session = await db.getActiveSession(userId);
let result;
if (session && session.age < MAX_SESSION_AGE) {
result = runClaude(prompt, { sessionId: session.sessionId });
} else {
result = runClaude(prompt);
await db.storeSession(userId, result.session_id);
}
await db.incrementCost(userId, result.total_cost_usd);
return result;
}

There are two mutually exclusive strategies here. With --resume, you get conversational continuity but sessions accumulate on disk at 50-500 KB each. With --no-session-persistence, you get a stateless architecture with no disk writes but no way to continue a conversation. Choose one per product — they cannot be mixed for the same user flow.

Implement a max session age or turn count, then start a fresh session. Without this, context grows unbounded and every subsequent turn costs more tokens.

Cost Amortization

Each claude -p subprocess loads the system prompt, CLAUDE.md, and MCP tool descriptions before processing your actual prompt. The minimum overhead is ~16K tokens (no MCP servers); with MCP servers active, init can reach ~50K tokens as each tool description adds 200-500 tokens.

Init Overhead at Scale (Opus)

Requests/dayInit costNote
100$1.60/dayAcceptable for early products
1,000$16/dayInit cost exceeds many SaaS margins
10,000$160/dayAmortization is mandatory at this scale

Three strategies reduce init cost:

Persistent stream-json mode pays init cost once and feeds prompts via stdin. Best for high-throughput services:

Terminal window
claude --output-format stream-json --input-format stream-json --verbose
# Feed prompts via stdin, read responses from stdout

Session resume pays init cost on the first request, then near-zero overhead on subsequent calls. Best for conversational products:

Terminal window
# First request — pays init cost
RESULT=$(claude -p "Hello" --output-format json --permission-mode bypassPermissions)
SESSION=$(echo "$RESULT" | jq -r '.session_id')
# Subsequent requests — near zero init overhead
claude -p "Follow up" --resume "$SESSION" --output-format json \
--permission-mode bypassPermissions

Stable system prompts exploit the cache. When every request uses the same --system-prompt string, cache hits give 90% savings on init tokens:

Terminal window
claude -p "$USER_PROMPT" \
--system-prompt "You are a code review assistant for Acme Corp." \
--output-format json --no-session-persistence \
--permission-mode bypassPermissions

Scaling Patterns

Scaling Considerations

DimensionConstraintMitigation
Memory~200 MB per Claude process (10 workers = 2 GB, 50 workers = 10 GB)Right-size worker pool to available RAM
ConcurrencyUnbounded spawns will OOM the hostUse a job queue (Bull, BullMQ, Bee-Queue) with a concurrency limiter
DiskSessions accumulate at 50-500 KB each (10K/day = up to 5 GB/day)Use —no-session-persistence or run cleanup crons
Rate limitingNo built-in per-user throttlingImplement in your API layer before enqueuing
Model routingOpus is 5x the cost of SonnetRoute simple tasks to Sonnet with —effort low, reserve Opus for complex reasoning

A worker queue is the core scaling primitive. Use it to cap concurrent Claude processes and protect your host from memory exhaustion:

import Queue from 'bull';
const claudeQueue = new Queue('claude-tasks', {
limiter: { max: 10, duration: 60_000 }
});
claudeQueue.process(async (job) => {
const { userId, prompt, sessionId } = job.data;
return runClaude(prompt, { sessionId });
});

Layer cost control at multiple levels. Per-request caps via --max-budget-usd prevent runaway single calls. Per-user daily limits in your API layer prevent abuse. Model routing sends cheap tasks to Sonnet and expensive ones to Opus:

const routing = {
simple_qa: { model: 'claude-sonnet-4-6', effort: 'low' },
code_review: { model: 'claude-sonnet-4-6', effort: 'medium' },
security_audit: { model: 'claude-opus-4-6', effort: 'high' },
};

Streaming to Clients

For real-time UIs, pipe --output-format stream-json through an SSE endpoint. The CLI emits NDJSON events, and your server translates text_delta events into SSE frames:

import { spawn } from 'child_process';
import { createInterface } from 'readline';
app.post('/api/stream', (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
const proc = spawn('claude', ['-p', req.body.prompt,
'--output-format', 'stream-json', '--verbose',
'--include-partial-messages',
'--permission-mode', 'bypassPermissions'],
{ env: { ...process.env, CLAUDECODE: '' } });
const rl = createInterface({ input: proc.stdout });
rl.on('line', (line) => {
if (!line.trim()) return;
const event = JSON.parse(line);
if (event.type === 'stream_event') {
const delta = event.event?.delta;
if (delta?.type === 'text_delta') {
res.write(`data: ${JSON.stringify({ text: delta.text })}\n\n`);
}
} else if (event.type === 'result') {
res.write(`data: ${JSON.stringify({ done: true, cost: event.total_cost_usd })}\n\n`);
res.end();
}
});
});

The --verbose flag is required when using stream-json output. Without it, the CLI suppresses the stream events you need. The --include-partial-messages flag ensures you get text deltas as they arrive rather than waiting for complete messages.

Gotcha

Each new claude -p subprocess loads ~16K-50K tokens of system prompt, tool descriptions, and CLAUDE.md before processing your actual prompt. At 1,000 requests/day on Opus, that is $16/day in pure init overhead. Use —resume, persistent stream-json sessions, or stable —system-prompt strings to amortize this cost across requests.

Gotcha

A session-per-user architecture accumulates context with every turn. By turn 20, each request may carry thousands of tokens of prior conversation. Implement a max session age or turn count and start fresh sessions periodically — otherwise your per-request cost grows linearly with conversation length and eventually dominates your budget.