When you wrap claude -p behind an HTTP API, you turn a local CLI tool into a multi-tenant agent runtime. The architecture is straightforward — queue jobs, spawn workers, parse JSON — but the economics demand attention. Each subprocess burns ~50K tokens on initialization before it touches your actual prompt, and at scale that overhead becomes your largest line item.
Architecture
A SaaS backend built on the CLI follows a single request lifecycle. The client never talks to Claude directly — your API layer handles auth, budgeting, and queueing, while workers manage the subprocess lifecycle.
Client -> HTTP API -> Queue -> Worker -> claude -p -> Parse JSON -> Return | Session Store (session_ids per user)Here is the request lifecycle in code. The runClaude wrapper handles subprocess invocation and returns the parsed JSON envelope:
import { execFileSync } from 'child_process';
function runClaude(prompt, { sessionId, budget = 0.50, schema } = {}) { const args = ['-p', prompt, '--output-format', 'json', '--max-budget-usd', String(budget), '--permission-mode', 'bypassPermissions']; if (sessionId) args.push('--resume', sessionId); else args.push('--no-session-persistence'); if (schema) args.push('--json-schema', JSON.stringify(schema));
const output = execFileSync('claude', args, { encoding: 'utf-8', timeout: 120_000, env: { ...process.env, CLAUDECODE: '' }, }); const data = JSON.parse(output); if (data.is_error) throw new Error(`Claude error (${data.subtype}): ${data.result}`); return data;}The API layer validates auth, checks the user’s budget, enqueues the job, and then a worker calls runClaude. The worker parses the JSON envelope, extracts result, total_cost_usd, and session_id, returns the result to the client, and logs cost to the user’s running total for billing.
Session Management
In a multi-tenant system, each user needs their own conversation state. Store session IDs per user in your database, and pass them to --resume on subsequent requests:
async function getOrCreateSession(userId, prompt) { const session = await db.getActiveSession(userId);
let result; if (session && session.age < MAX_SESSION_AGE) { result = runClaude(prompt, { sessionId: session.sessionId }); } else { result = runClaude(prompt); await db.storeSession(userId, result.session_id); }
await db.incrementCost(userId, result.total_cost_usd); return result;}There are two mutually exclusive strategies here. With --resume, you get conversational continuity but sessions accumulate on disk at 50-500 KB each. With --no-session-persistence, you get a stateless architecture with no disk writes but no way to continue a conversation. Choose one per product — they cannot be mixed for the same user flow.
Implement a max session age or turn count, then start a fresh session. Without this, context grows unbounded and every subsequent turn costs more tokens.
Cost Amortization
Each claude -p subprocess loads the system prompt, CLAUDE.md, and MCP tool descriptions before processing your actual prompt. The minimum overhead is ~16K tokens (no MCP servers); with MCP servers active, init can reach ~50K tokens as each tool description adds 200-500 tokens.
Init Overhead at Scale (Opus)
| Requests/day | Init cost | Note |
|---|---|---|
| 100 | $1.60/day | Acceptable for early products |
| 1,000 | $16/day | Init cost exceeds many SaaS margins |
| 10,000 | $160/day | Amortization is mandatory at this scale |
Three strategies reduce init cost:
Persistent stream-json mode pays init cost once and feeds prompts via stdin. Best for high-throughput services:
claude --output-format stream-json --input-format stream-json --verbose# Feed prompts via stdin, read responses from stdoutSession resume pays init cost on the first request, then near-zero overhead on subsequent calls. Best for conversational products:
# First request — pays init costRESULT=$(claude -p "Hello" --output-format json --permission-mode bypassPermissions)SESSION=$(echo "$RESULT" | jq -r '.session_id')
# Subsequent requests — near zero init overheadclaude -p "Follow up" --resume "$SESSION" --output-format json \ --permission-mode bypassPermissionsStable system prompts exploit the cache. When every request uses the same --system-prompt string, cache hits give 90% savings on init tokens:
claude -p "$USER_PROMPT" \ --system-prompt "You are a code review assistant for Acme Corp." \ --output-format json --no-session-persistence \ --permission-mode bypassPermissionsScaling Patterns
Scaling Considerations
| Dimension | Constraint | Mitigation |
|---|---|---|
| Memory | ~200 MB per Claude process (10 workers = 2 GB, 50 workers = 10 GB) | Right-size worker pool to available RAM |
| Concurrency | Unbounded spawns will OOM the host | Use a job queue (Bull, BullMQ, Bee-Queue) with a concurrency limiter |
| Disk | Sessions accumulate at 50-500 KB each (10K/day = up to 5 GB/day) | Use —no-session-persistence or run cleanup crons |
| Rate limiting | No built-in per-user throttling | Implement in your API layer before enqueuing |
| Model routing | Opus is 5x the cost of Sonnet | Route simple tasks to Sonnet with —effort low, reserve Opus for complex reasoning |
A worker queue is the core scaling primitive. Use it to cap concurrent Claude processes and protect your host from memory exhaustion:
import Queue from 'bull';
const claudeQueue = new Queue('claude-tasks', { limiter: { max: 10, duration: 60_000 }});
claudeQueue.process(async (job) => { const { userId, prompt, sessionId } = job.data; return runClaude(prompt, { sessionId });});Layer cost control at multiple levels. Per-request caps via --max-budget-usd prevent runaway single calls. Per-user daily limits in your API layer prevent abuse. Model routing sends cheap tasks to Sonnet and expensive ones to Opus:
const routing = { simple_qa: { model: 'claude-sonnet-4-6', effort: 'low' }, code_review: { model: 'claude-sonnet-4-6', effort: 'medium' }, security_audit: { model: 'claude-opus-4-6', effort: 'high' },};Streaming to Clients
For real-time UIs, pipe --output-format stream-json through an SSE endpoint. The CLI emits NDJSON events, and your server translates text_delta events into SSE frames:
import { spawn } from 'child_process';import { createInterface } from 'readline';
app.post('/api/stream', (req, res) => { res.setHeader('Content-Type', 'text/event-stream'); const proc = spawn('claude', ['-p', req.body.prompt, '--output-format', 'stream-json', '--verbose', '--include-partial-messages', '--permission-mode', 'bypassPermissions'], { env: { ...process.env, CLAUDECODE: '' } }); const rl = createInterface({ input: proc.stdout });
rl.on('line', (line) => { if (!line.trim()) return; const event = JSON.parse(line); if (event.type === 'stream_event') { const delta = event.event?.delta; if (delta?.type === 'text_delta') { res.write(`data: ${JSON.stringify({ text: delta.text })}\n\n`); } } else if (event.type === 'result') { res.write(`data: ${JSON.stringify({ done: true, cost: event.total_cost_usd })}\n\n`); res.end(); } });});The --verbose flag is required when using stream-json output. Without it, the CLI suppresses the stream events you need. The --include-partial-messages flag ensures you get text deltas as they arrive rather than waiting for complete messages.
Each new claude -p subprocess loads ~16K-50K tokens of system prompt, tool descriptions, and CLAUDE.md before processing your actual prompt. At 1,000 requests/day on Opus, that is $16/day in pure init overhead. Use —resume, persistent stream-json sessions, or stable —system-prompt strings to amortize this cost across requests.
A session-per-user architecture accumulates context with every turn. By turn 20, each request may carry thousands of tokens of prior conversation. Implement a max session age or turn count and start fresh sessions periodically — otherwise your per-request cost grows linearly with conversation length and eventually dominates your budget.