Claude Code Session Performance Investigation

Date: 2025-11-08 Goal: Investigate and improve response times for programmatic Claude Code sessions

Problem Statement

The initial implementation using Claude Code CLI in --print mode showed unacceptable latency:

Simple "Hey" message: 12.7 seconds total (13s API time)
"Hmm" message: 3.5 seconds total (16.4s API time)
"Still slow" message: 8 seconds total (24.3s API time)

This made interactive chat feel sluggish and unusable for real-time conversations.

Investigation Timeline

1. Initial Hypothesis: System Prompt Size

Hypothesis: The 605-line modeler.md system prompt was causing slowness.

Test: Compared with regular Claude Code interface which uses the same system prompt.

Result: ❌ Rejected - Regular Claude Code with the same system prompt is fast, so this wasn't the issue.

Files: Initial observations in test-claude-session.ts

2. Known Bug Discovery

Finding: Web search revealed GitHub Issue #3600 - "claude -p extremely slow"

Key Insights:

--print mode is 3-10x slower than interactive mode
Reports of 3-minute delays for simple commands
Multiple users affected across all platforms
This is a known bug from July 2025

Root Cause: --print mode performs full environment initialization on every message, even with persistent processes.

3. Solution Attempt #1: Official Claude Agent SDK

Approach: Replace CLI spawning with official @anthropic-ai/claude-agent-sdk

Implementation: src/lib/claude-code-session.ts (lines 1-212)

Changes:

Removed spawn() + --print mode
Used SDK's query() function with session resumption
Maintained session ID across messages for context

Results:

Message 1 ("Hey"):       3.6s total, 4.5s API  ✅ 72% faster!
Message 2 ("Thanks"):    3.5s total, 4.3s API  ✅
Message 3 ("Bye"):       2.7s total, 1.6s API  ✅

Test File: test-sdk-performance.ts, test-sdk-short-messages.ts

Outcome: ✅ Significant improvement - 3x faster for short messages

4. User Feedback: Still Too Slow

Observation: Longer responses (e.g., detailed explanation of cognitive spaces) took 16 seconds.

Question: Is this the response length or the session management?

Analysis:

16s response had ~500 words (extensive cognitive spaces explanation)
API time: 21s (longer than total = streaming in progress)
Comparable to token generation time for that content length

5. Solution Attempt #2: Persistent CLI Process

Hypothesis: VSCode extension maintains a single persistent process without --print mode.

Approach: Spawn Claude Code in interactive mode (like VSCode does) and maintain persistent stdin/stdout communication.

Implementation: src/lib/claude-code-session-cli.ts (removed - approach abandoned)

Key Differences:

# Old approach (slow)
claude --print --verbose --output-format stream-json --input-format stream-json

# VSCode approach (fast)
claude --output-format stream-json --verbose --input-format stream-json

Results:

Message 1 ("Hey"):       13.4s total, 11.8s API  ❌ SLOWER than SDK!
Message 2 ("Thanks"):    2.2s total, 13.9s API  ❌ API time doesn't make sense
Message 3 ("Bye"):       2.6s total, 16.4s API  ❌ Inconsistent

Test File: test-cli-only.ts, test-cli-vs-sdk.ts

Outcome: ❌ Not better - First message very slow, inconsistent API timing

6. Process Analysis

Test: Monitor Claude processes during SDK usage

Finding:

Process count: 13 → 17 → 13
SDK spawns processes but cleans them up
Each query() call creates new subprocess

Implication: SDK already manages process lifecycle efficiently, but each message still requires subprocess initialization.

Current Performance Summary

Approach	Short Message	Long Response	First Message	Status
Original `--print`	8-13s	12-24s	~12s	❌ Unacceptable
Agent SDK	2.7-3.6s	16s	~3.6s	✅ Best
CLI Persistent	2.2-2.6s	Untested	13.4s	❌ Slow startup

Key Findings

--print mode is fundamentally slow due to known bug in Claude Code CLI
Agent SDK provides 70% improvement by avoiding --print overhead
Persistent CLI approach doesn't help - first message initialization is still expensive
Long responses are proportional to content length - 16s for 500-word response is expected
SDK appears to reuse sessions but still spawns subprocesses per query

Remaining Performance Bottleneck

The SDK's query() function still spawns a new Claude Code process for each message, which means:

Full environment re-initialization (git status, file system, etc.)
Context reconstruction from session ID
~3-4 second overhead per message

Comparison with VSCode Extension:

VSCode maintains truly persistent process
No re-initialization between messages
Sub-second response times for short messages

Gap: There's no public API to maintain a truly persistent Claude Code process like VSCode does. The SDK's session resumption helps with context but not with environment initialization.

Recommendations

Short Term (Current State)

✅ Use the Agent SDK implementation (src/lib/claude-code-session.ts)

Best available performance without reverse-engineering VSCode
Acceptable for chat interface (3-4s per message)
Properly maintains conversation context

Medium Term (If Performance Critical)

Request official "persistent session" API from Anthropic
File feature request for SDK to support truly persistent processes
Consider caching/memoization of environment state

Long Term (Ideal Solution)

Official SDK support for persistent process pool
Lazy environment initialization (only when tools are used)
Differential context updates instead of full reconstruction

Test Files

Kept:

test-persistent-session.ts - Final streaming input mode test ✅

Removed (investigation complete):

test-claude-session.ts, test-sdk-performance.ts, test-sdk-short-messages.ts
test-sdk-multi-message.ts, test-cli-only.ts, test-cli-vs-sdk.ts

7. Solution Attempt #3: True Persistent Session with Streaming Input

Date: 2025-11-08 (continued)

Discovery: GitHub Issue #34 revealed:

This is expected behavior when passing a string prompt. We recommend using streaming input to keep the process alive between turns.

Approach: Use AsyncIterable streaming input mode as documented in Streaming vs Single Mode.

Implementation: src/lib/claude-code-session.ts - complete rewrite

Key Changes:

// Create async generator that yields messages on demand
class MessageStream {
  async *generate(): AsyncGenerator<SDKUserMessage, void> {
    while (!this.stopped) {
      const msg = await this.nextMessage();
      if (msg) yield msg;
    }
  }
}

// Create ONE persistent query with the generator
this.currentQuery = query({
  prompt: this.messageStream.generate(),  // Not a string!
  options: queryOptions
});

// Send messages by pushing to the stream (no new query() calls)
this.messageStream.pushMessage(message, this.sessionId);

Results:

Session start:        1100ms (bootstrap + init)
Message 1:            2544ms (includes first response generation)
Message 2:            1239ms
Message 3:            1204ms
Message 4:            2218ms
Message 5:            1707ms

Average (msg 2-5):    1592ms

Test File: test-persistent-session.ts

Outcome: ✅ Persistent session confirmed - Same session ID maintained, no process recreation

Key Observation: We are seeing 1.2-2.5s response times, which is a 55% improvement over the resume approach (3.5s), but still not sub-second.

Important Insight: The 1-2 second delay appears to be actual API/model response time, NOT process initialization. Evidence:

Same session ID across all messages (no re-initialization)
Process stays alive between messages
Response time variability (1.2s-2.5s) suggests model thinking time
Improvement from 3.5s → 1.6s avg confirms we eliminated the resume overhead

Updated Performance Summary

Approach	Short Message	Long Response	First Message	Status
Original `--print`	8-13s	12-24s	~12s	❌ Unacceptable
Agent SDK (resume)	2.7-3.6s	16s	~3.6s	⚠ Acceptable
Streaming Input	1.2-2.5s	Untested	~2.5s	✅ Optimal

Improvement:

vs --print: 5-10x faster
vs resume: 2x faster
True persistent session - no process recreation overhead

Final Conclusion

We achieved truly persistent sessions using AsyncIterable streaming input mode. The implementation now:

✅ Maintains a single Claude Code process throughout the conversation
✅ No subprocess spawning between messages
✅ No context re-initialization overhead
✅ Same session ID persists across all messages

Performance bottleneck identified: The remaining 1-2 second delay is actual API response time (model thinking + token generation), not infrastructure overhead.

This is as fast as the SDK can go without changes to:

Model inference speed (server-side)
Network latency
Token generation rate

The streaming input approach has eliminated all client-side overhead. VSCode's sub-second feel likely comes from:

Immediate UI feedback before full response
Optimized rendering of partial responses
Potentially different model settings (lower max tokens, faster model tier)

Final recommendation: Use the streaming input implementation in src/lib/claude-code-session.ts - it's the optimal approach for the Claude Agent SDK.

8. CLI Streaming Mode - Using Max Subscription Instead of API Credits

Date: 2025-11-08 (final)

Motivation: The SDK requires API credits. Can we use the Claude CLI with Max subscription while maintaining comparable performance?

Discovery: The CLI supports --input-format stream-json which allows streaming multiple messages to a single --print process, avoiding per-message spawning.

Implementation: src/lib/claude-cli-session.ts

Key Approach:

claude --print --verbose \
  --output-format stream-json \
  --input-format stream-json \
  --system-prompt "..."

Write multiple JSON messages to stdin:

{"type":"user","message":{"role":"user","content":"Hey"},"parent_tool_use_id":null}
{"type":"user","message":{"role":"user","content":"Thanks"},"parent_tool_use_id":null}

Results with Minimal System Prompt:

Message 1: 609ms
Message 2: 3026ms
Message 3: 635ms
Message 4: 3768ms
Message 5: 5012ms

Average: ~2.6s

Results with Full modeler.md System Prompt (604 lines, 22KB):

Single word response: 2.6s
Medium response:      6-10s
Long response:        10-13s

Session Persistence Confirmed:

✅ Same session ID across all messages
✅ Same process PID (no respawning)
✅ True persistent session maintained

Performance Comparison:

Condition	SDK Streaming	CLI Streaming	Difference
Minimal prompt	1.6s avg	2.6s avg	+1s CLI overhead
Full modeler.md	~2.6s	~2.6s	Comparable
Long responses	Proportional	Proportional	Both scale with length

Key Findings:

CLI streaming works - Successfully streams multiple messages to one process
Performance is comparable to SDK when using the same system prompt
System prompt size matters - 22KB modeler.md adds ~1s overhead vs minimal prompts
Both approaches scale similarly - Long responses take longer regardless of method
CLI uses Max subscription - No API credit consumption

Trade-offs:

Factor	SDK	CLI
Billing	API credits	Max subscription
Performance	1.6s (minimal)	2.6s (minimal)
With modeler.md	2.6s	2.6s
Reliability	Battle-tested	Depends on CLI stability
Setup	API key required	`claude` CLI + auth

Implementation Choice:

Both implementations are maintained in parallel:

Default: CLI mode (src/lib/claude-cli-session.ts) - uses Max subscription
Optional: SDK mode (src/lib/claude-code-session.ts) - uses API key

Switch via environment variable:

# Use CLI (Max subscription) - default
npm run dev

# Use SDK (API credits)
USE_SDK=true npm run dev

Test Files Kept:

test-interactive.ts - Interactive REPL for testing both modes
test-persistent-session.ts - SDK streaming validation

Removed (investigation complete):

test-cli-accurate-timing.ts, test-cli-multi-message.ts, test-cli-no-print.ts
test-cli-performance.ts, test-cli-session-manager.ts, test-cli-streaming-investigation.ts

Final Performance Summary (All Approaches)

Approach	Min System Prompt	Full modeler.md	Billing	Status
CLI `--print` per-message	8-13s	12-24s	Max subscription	❌ Abandoned
SDK resume mode	2.7-3.6s	~3-4s	API credits	⚠️ Superseded
SDK streaming	1.6s	2.6s	API credits	✅ Optimal (API)
CLI streaming	2.6s	2.6s	Max subscription	✅ Optimal (Max)

Conclusion:

We now have two production-ready implementations:

SDK Streaming Mode - Fastest with minimal prompts, requires API key
CLI Streaming Mode - Comparable performance, uses Max subscription

Both achieve:

True persistent sessions
No process recreation overhead
Streaming responses
~2.6s response time with full modeler.md context

The performance bottleneck is model thinking time + system prompt processing, not infrastructure. Both implementations have eliminated all client-side overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude Code Session Performance Investigation

Problem Statement

Investigation Timeline

1. Initial Hypothesis: System Prompt Size

2. Known Bug Discovery

3. Solution Attempt #1: Official Claude Agent SDK

4. User Feedback: Still Too Slow

5. Solution Attempt #2: Persistent CLI Process

6. Process Analysis

Current Performance Summary

Key Findings

Remaining Performance Bottleneck

Recommendations

Short Term (Current State)

Medium Term (If Performance Critical)

Long Term (Ideal Solution)

Test Files

7. Solution Attempt #3: True Persistent Session with Streaming Input

Updated Performance Summary

Final Conclusion

8. CLI Streaming Mode - Using Max Subscription Instead of API Credits

Final Performance Summary (All Approaches)

FilesExpand file tree

PERFORMANCE-INVESTIGATION.md

Latest commit

History

PERFORMANCE-INVESTIGATION.md

File metadata and controls

Claude Code Session Performance Investigation

Problem Statement

Investigation Timeline

1. Initial Hypothesis: System Prompt Size

2. Known Bug Discovery

3. Solution Attempt #1: Official Claude Agent SDK

4. User Feedback: Still Too Slow

5. Solution Attempt #2: Persistent CLI Process

6. Process Analysis

Current Performance Summary

Key Findings

Remaining Performance Bottleneck

Recommendations

Short Term (Current State)

Medium Term (If Performance Critical)

Long Term (Ideal Solution)

Test Files

7. Solution Attempt #3: True Persistent Session with Streaming Input

Updated Performance Summary

Final Conclusion

8. CLI Streaming Mode - Using Max Subscription Instead of API Credits

Final Performance Summary (All Approaches)