Phase 1 (one-shot-factory-go-plan.md) implements a fixed pipeline: intake → bootstrap → implement → test → iterate. This works for the first pass but hard-codes the loop structure and the relationship between implementation and testing.
Phase 2 moves to an issue-driven loop aligned with the target architecture in architecture.md. The orchestrator becomes a thin scheduler that picks the next issue and delegates implementation work to the agent. The orchestrator owns validation (parse, lint, evaluate, instantiate, run tests) after every agent turn, feeding failures back so the agent can self-correct.
The factory loop iterates over issues in the project, one at a time. Each issue describes a unit of work the agent should complete. The orchestrator's only job is:
- Select the next unblocked issue (based on ordering / dependency rules)
- Hand it to the agent
- Wait for the agent to exit
- Run validation (parse, lint, evaluate, instantiate, run tests) — feed failures back as context
- Read updated issue state and repeat (inner loop) or advance to next issue (outer loop)
The agent always exits the same way — the orchestrator reads the issue's updated status/tags to decide what happened. If the agent tagged the issue as blocked (e.g., needs human clarification), the orchestrator skips it and moves on. If the issue is marked done, the orchestrator advances. This keeps the agent's exit path uniform — it doesn't need a separate "blocked" signal in its return type, it just updates the issue and exits.
This makes the loop generic. It doesn't need to know whether an issue is "implement a card", "write tests", "create the project spec", or "break down the brief into tickets". The agent reads the issue, does the work, updates the issue status, and exits.
Issues need properties that let the orchestrator determine execution order. Possible fields (may use a combination):
- priority — enum (
critical,high,medium,low), critical = execute first - predecessors / blockedBy — explicit dependency edges; an issue cannot start until its blockers are done
- order — explicit sequence number for tie-breaking
The selection algorithm (implemented in IssueScheduler.pickNextIssue()):
- Filter to issues with status
backlogorin_progress - Exclude issues whose
blockedBylist contains any non-completed issue - Exclude exhausted issues (hit
maxIterationsPerIssuein the current run) - Sort:
in_progressfirst, then by priority (critical>high>medium>low), then by order (ascending) - Pick the first one
Resume semantics: if an issue is already in_progress, it takes priority over backlog issues (the factory was interrupted and should continue where it left off).
The loop has two levels:
- Outer loop — iterates over all unblocked, unfinished issues (picks the next one when the current one is done or blocked)
- Inner loop — iterates on a single issue until the agent marks it done, blocked, or max iterations are reached
Every inner-loop iteration (agent turn) is followed by a validation phase owned by the orchestrator. An issue may require multiple iterations before it's done — validation runs after each one. This is similar to how Phase 1 runs tests after the agent signals done, but expanded to a full automated evaluation pipeline. The agent does not need to create separate "run tests" issues — validation is baked into the inner loop.
After each agent turn in the inner loop, the orchestrator runs these checks deterministically (as described in architecture.md):
- Parse — Verify that all modified
.gtsand.jsonfiles are syntactically valid - Lint — Run lint checks on modified files
- Module evaluation — Ensure card modules load and evaluate without errors (import resolution, no runtime crashes)
- Card instantiation — Verify that sample card instances can be instantiated from their definitions
- Run existing tests — Execute all QUnit
.test.gtsfiles in the target realm via the QUnit test page
The validation pipeline is implemented as a modular system in src/validators/:
ValidationStepRunner interface — the contract every step must implement:
interface ValidationStepRunner {
readonly step: ValidationStep;
run(targetRealmUrl: string): Promise<ValidationStepResult>;
formatForContext(result: ValidationStepResult): string;
}ValidationPipeline class — implements the Validator interface and composes step runners:
- Steps run concurrently via
Promise.allSettled()— a failure or exception in one step does not prevent others from running - Exceptions thrown by a step are captured as failed
ValidationStepResultentries with the error message formatForContext()delegates to each step runner to produce LLM-friendly markdowncreateDefaultPipeline(config)factory function composes all 5 steps with config injection
Step-specific failure shapes — each validation type carries its own structured data in ValidationStepResult.details (flattened POJOs, not cards):
- Test step:
{ testRunId, passedCount, failedCount, failures: [{ testName, module, message, stackTrace }] }— reads back the completed TestRun card from the realm for detailed failure data (will become cheap local filesystem reads after boxel-cli integration) - Future parse/lint/evaluate/instantiate steps: each defines its own
detailsshape
Adding a new validation step = creating a new module file in src/validators/ + replacing the NoOpStepRunner in createDefaultPipeline().
Validation failures are fed back to the agent as context in the next inner-loop iteration. The orchestrator does not create fix issues for validation failures — it iterates with the failure details so the agent can self-correct. This mirrors Phase 1's approach (feed test results back, iterate) but with a broader validation pipeline.
The inner loop continues until:
- The agent marks the issue as done (all validation passes)
- The agent marks the issue as blocked (needs human input)
- Max iterations are reached with failing validation — the orchestrator blocks the issue with the reason ("max iteration limit reached") and the formatted validation failure context in the issue description, then moves to the next issue
- Max iterations are reached with passing validation — the issue is exhausted but not blocked (agent did not mark done despite passing validation)
The agent always has the option to create new issues via tool calls if it determines that a failure requires separate work (e.g., "this card definition depends on another card that doesn't exist yet — creating a new issue for it"). But the orchestrator does not force this — the agent decides.
During task breakdown, the agent creates issues for implementation work:
- "Implement StickyNote card definition" (type: implement)
- "Create sample StickyNote instances" (type: implement)
- "Write QUnit tests for StickyNote" (type: implement)
- "Create Catalog Spec for StickyNote" (type: implement)
The agent does not need to create "run tests" issues. Test execution happens automatically as part of the validation phase after every inner-loop iteration.
Bootstrap issues (the seed issue that creates Project, KnowledgeArticles, and implementation issues) produce no testable code artifacts — only JSON card instances. Validation still runs after every inner-loop iteration, but each step gracefully handles "nothing to validate":
| Step | Bootstrap behavior |
|---|---|
| Parse | Checks created .json files are valid — useful |
| Lint | No-op for JSON card instances — pass |
| Module evaluation | No .gts modules created — no-op, pass |
| Card instantiation | Verifies Project/KnowledgeArticle/Issue instances are valid — useful |
| Run tests | No test files exist yet — vacuous pass |
Design principle: No special-casing per issue type. Each validation step returns passed: true with an empty errors array when there is nothing to validate. "Nothing to validate" is a pass, not an error.
The inner loop exit for bootstrap follows the same mechanism as any other issue: the agent marks the seed issue as done via tool calls, refreshIssueState() reads the updated status, and the inner loop condition (issue.status !== 'done') exits. The outer loop then calls pickNextIssue() and finds the newly created implementation issues.
Phase 1 calls this "testing" — the orchestrator runs tests after the agent signals done, feeds failures back, and iterates. Phase 2 generalizes this to a full validation pipeline (parse + lint + evaluate + instantiate + test) and feeds all failures back in the same way. The key evolution is that validation is broader (not just tests) and runs after every agent turn (not just when the agent signals done). The validation is still orchestrator-owned and deterministic — the agent never decides whether to run validation.
In phase 1, bootstrap (creating the Project, KnowledgeArticles, and initial Tickets) is a separate orchestrator phase that runs before the loop. In phase 2, bootstrap is itself driven by issues.
The flow becomes:
- Factory starts with a brief URL and a target realm
- The orchestrator creates a single seed issue: "Process brief and create project artifacts"
- The agent picks up this seed issue, reads the brief, and creates:
- The Project card
- KnowledgeArticle cards
- The initial set of implementation issues (card definitions, instances, specs, tests)
- The agent marks the seed issue as done
- The orchestrator now has a populated issue backlog and continues the normal loop
This is the "quirk" where an issue's job is to create the project itself. But it's a natural fit — the LLM participates in brief processing and task breakdown as part of the loop, not as a separate hard-coded phase. This was already identified as a goal (the plan mentions LLM participation in brief processing / artifact creation).
- The LLM can ask clarifying questions during bootstrap (by tagging the seed issue as blocked)
- Task breakdown quality improves because the LLM sees the full brief context and can make judgment calls
- The bootstrap process is testable with the same MockFactoryAgent pattern used for implementation issues
- Resume works naturally — if the factory crashes during bootstrap, the seed issue is still
in_progressand gets picked up on restart
The phase 2 orchestrator is a thin scheduler with a built-in validation phase that runs after every agent turn:
// As implemented in runIssueLoop() — src/issue-loop.ts
let exhaustedIssues = new Set<string>();
while (
scheduler.hasUnblockedIssues(exhaustedIssues) &&
outerCycles < maxOuterCycles
) {
let issue = scheduler.pickNextIssue(exhaustedIssues);
// Inner loop: multiple iterations per issue
let validationResults = undefined;
let exitReason = 'max_iterations';
for (let iteration = 1; iteration <= maxIterationsPerIssue; iteration++) {
let context = await contextBuilder.buildForIssue({
issue,
targetRealmUrl,
validationResults,
briefUrl,
});
let result = await agent.run(context, tools);
// Validation phase — runs after EVERY agent turn
validationResults = await validator.validate(targetRealmUrl);
// Read issue state from realm (not from AgentRunResult.status)
issue = await scheduler.refreshIssueState(issue);
if (issue.status === 'done' || issue.status === 'blocked') {
exitReason = issue.status;
break;
}
}
if (exitReason === 'max_iterations') {
// If validation still failing at max iterations, block the issue
// with the reason and failure context written to the realm
if (validationResults && !validationResults.passed) {
exitReason = 'blocked';
await issueStore.updateIssue(issue.id, {
status: 'blocked',
description: buildMaxIterationBlockedDescription(validationResults),
});
}
exhaustedIssues.add(issue.id);
}
// Reload to pick up new issues the agent may have created
await scheduler.loadIssues();
}The agent signals progress by updating the issue — tagging it as blocked, marking it done, or leaving it in progress for another iteration. The orchestrator reads issue state from the realm after each agent turn, then runs validation. Validation failures are fed back as context in the next inner-loop iteration so the agent can self-correct. The agent can also create new issues via tool calls if it determines a failure requires separate work.
The orchestrator also writes to the realm in one case: when max iterations are reached with failing validation, it updates the issue's status to blocked and writes the formatted validation failure context into the issue description. This uses IssueStore.updateIssue(), which performs a read-modify-write against the realm card.
All domain logic (what to implement, when to create sub-issues, when to tag as blocked) lives in the agent's prompt and skills. The orchestrator owns only: issue selection, agent invocation, validation, and max-iteration blocking.
RealmIssueStore loads issues from the target realm using searchRealm() from realm-operations.ts. The search filter uses the absolute darkfactory module URL (from inferDarkfactoryModuleUrl(targetRealmUrl)), which varies by environment (production, staging, localhost). The store maps JSON:API card responses to SchedulableIssue objects.
Boxel encodes linksToMany relationships with dotted keys rather than JSON:API data arrays:
{
"relationships": {
"blockedBy.0": { "links": { "self": "../Issues/issue-a" } },
"blockedBy.1": { "links": { "self": "../Issues/issue-b" } }
}
}The extractLinksToManyIds() helper parses this format to extract blocker IDs for dependency resolution.
When searchRealm() fails (auth, network, query errors), the store logs at warn level and returns an empty list — preventing the loop from silently treating a failure as "no issues exist."
The loop distinguishes several terminal states:
| Condition | Outcome |
|---|---|
| No issues loaded | all_issues_done |
| Issues exist but all blocked at startup | no_unblocked_issues |
| All issues completed successfully | all_issues_done |
| Some issues done, others blocked or exhausted | no_unblocked_issues |
| Safety guard hit | max_outer_cycles |
Phase 1 defined Project and Ticket card types in darkfactory.gts with aspirational fields that were never used. Phase 2 trims these to only the fields that are actually set or read in code, and renames Ticket → Issue to match the issue-driven loop language.
Keep (actively set or read in bootstrap, prompts, skill loader, or tool builder):
| Field | Type | Used By |
|---|---|---|
projectCode |
String | Bootstrap, tests, templates |
projectName |
String | Bootstrap, prompts, templates |
projectStatus |
ProjectStatusField enum | Bootstrap (set to 'active'), templates |
objective |
TextAreaField | Bootstrap (from brief summary), prompts |
scope |
MarkdownField | Bootstrap (from brief sections), tests |
technicalContext |
MarkdownField | Bootstrap, templates |
issues |
linksToMany(Issue) with query | Auto-queried, templates (renamed from tickets) |
knowledgeBase |
linksToMany(KnowledgeArticle) | Bootstrap, skill loader |
successCriteria |
MarkdownField | Bootstrap, prompts |
testArtifactsRealmUrl |
StringField | Tool builder (test execution) |
Drop (defined but never set or read by factory code):
| Field | Why Drop |
|---|---|
deadline |
Never set or read |
teamAgents |
Only in demo fixtures — never read by factory logic |
risks |
Never set or read |
createdAt |
Never set or read on Project (Tickets do use it) |
Rename Ticket to Issue throughout. Field renames: ticketId → issueId, ticketType → issueType.
Keep (actively set or read):
| Field | Type | Used By |
|---|---|---|
issueId |
String | Bootstrap, tests, templates (was ticketId) |
summary |
String | Bootstrap, prompts, templates |
description |
MarkdownField | Bootstrap, templates |
issueType |
IssueTypeField enum | Bootstrap (set to 'feature'), tests (was ticketType) |
status |
IssueStatusField enum | Bootstrap, factory-implement.ts (updated post-completion), prompts |
priority |
IssuePriorityField enum | Bootstrap, prompts, templates |
project |
linksTo(Project) | Bootstrap, skill loader |
assignedAgent |
linksTo(AgentProfile) | pick-ticket.ts (assignment workflow) |
relatedKnowledge |
linksToMany(KnowledgeArticle) | Skill loader (filters skills by knowledge tags) |
acceptanceCriteria |
MarkdownField | Bootstrap, prompts |
createdAt |
DateTimeField | Bootstrap (set to context.now) |
updatedAt |
DateTimeField | Bootstrap (set to context.now) |
Drop (defined but never set or read):
| Field | Why Drop |
|---|---|
relatedTickets |
Never set or read (Phase 2 uses blockedBy/predecessors for dependencies instead) |
agentNotes |
Never set or read |
estimatedHours |
Never set or read |
actualHours |
Never set or read |
The issue-driven loop needs dependency tracking fields not in Phase 1:
| Field | Type | Purpose |
|---|---|---|
blockedBy |
linksToMany(Issue) | Explicit dependency edges — issue can't start until blockers are done |
order |
NumberField | Sequence number for tie-breaking when priorities are equal |
These were described in the "Issue Ordering and Dependencies" section above but need to be added to the Issue card definition.
The darkfactory Project and Issue definitions are a stopgap — they duplicate fields that should come from the high-quality task tracker cards in the catalog. Longer term, both should adoptsFrom the catalog's task tracker card types rather than maintaining their own field definitions. This means:
- Project adopts from the catalog's Project/Board card (inherits status tracking, team management, etc.)
- Issue adopts from the catalog's Task/Issue card (inherits status workflows, priority, dependencies, etc.)
- darkfactory.gts only adds factory-specific fields (e.g.,
testArtifactsRealmUrl) on top of the inherited base
This aligns with the catalog-first philosophy: the factory uses the same card types that users create in Boxel, not a parallel schema. It also means improvements to the catalog task tracker (better status workflows, richer dependency modeling) automatically flow into the factory.
CS-10671 trims and renames the current schema as a first step. The adoption from catalog task tracker cards may happen as part of Phase 2 or as a follow-on — timing TBD.
backlog → in_progress → done
→ blocked (needs human input or max iterations with failing validation)
→ review (optional)
The agent manages its own transitions by updating the issue directly (e.g., tagging as blocked, marking done). The orchestrator reads the issue state after the agent exits to decide what to do next — it does not inspect the agent's return value for status.
The orchestrator also transitions issues to blocked in one case: when max iterations are reached with validation still failing. It writes the reason ("max iteration limit reached") and the formatted validation failure context into the issue description via IssueStore.updateIssue(), making the blocking reason visible in the realm. Issues blocked this way are also added to an exhaustedIssues set to prevent re-selection within the same run.
Phase 1 and phase 2 coexist during the transition. The implementation lives in separate files to avoid touching Phase 1 code:
src/issue-scheduler.ts—IssueScheduler,IssueStore,RealmIssueStoresrc/issue-loop.ts—runIssueLoop(),Validator,NoOpValidator, config/result types
Phase 1's factory-loop.ts (runFactoryLoop()) remains untouched. The LoopAgent interface (run(context, tools)) is unchanged and reused by both loops. FactoryTool[] carries forward unchanged.
CS-10708 tracks the integration work: wire runIssueLoop(), the validation phase (CS-10675), and bootstrap-as-seed (CS-10673) into the factory:go entrypoint, then retire all Phase 1 mechanisms (factory-loop.ts, factory-loop.test.ts, old bootstrap orchestration, TestRunner/buildTestRunner()).
In phase 1, request_clarification is a pure control flow signal — the agent calls it, the loop returns clarification_needed, and the message appears in the JSON output. Nothing is persisted to the realm, so the clarification request is lost if the output isn't captured.
In phase 2, request_clarification should create a blocking issue in the realm that signals to the outside world that human input is needed. This makes clarification requests durable, visible in the Boxel UI, and resolvable by a human through the normal issue workflow.
When the agent calls request_clarification:
- Create a new issue in the target realm with:
- type:
clarification - status:
blocked - summary: a short description of what's needed (from the agent's message)
- description: full context — what the agent was working on, what it tried, and what specific input it needs from a human
- blockedBy: (none — this issue IS the blocker)
- blocks: the current issue the agent was working on (so the blocked issue can't resume until clarification is resolved)
- type:
- Update the current issue's status to
blockedwith a reference to the clarification issue - The agent exits its turn
The orchestrator then sees the current issue is blocked and moves on (or stops if no other unblocked issues exist).
A human resolves the clarification by:
- Opening the clarification issue in Boxel
- Adding a response (e.g., updating the issue description with the answer, or adding a comment)
- Marking the clarification issue as
done
This automatically unblocks the dependent issue. On the next orchestrator iteration, the previously-blocked issue becomes eligible for execution. The agent picks it up, sees the resolved clarification issue in context, and continues.
In phase 1, the tool stays as-is (signal-only). The phase 2 refactor replaces the tool's execute function to:
- Write the clarification issue to the realm via
writeCardSource - Update the current issue's
blockedByfield - Return the
CLARIFICATION_SIGNAL(so the loop still exits correctly)
The LoopAgent and runFactoryLoop signatures don't change — the signal mechanism is preserved, but now it has a durable side effect.
The boxel-cli integration work is tracked in a dedicated Linear project: "Incorporate Boxel CLI to Monorepo". Key tickets include:
- CS-10519 — Import boxel-cli into monorepo as
packages/boxel-cli - CS-10520 — Factory as boxel-cli subcommands; migrate realm-operations; retire file I/O tools
- CS-10642 — boxel-cli owns full auth lifecycle (realm server tokens, per-realm tokens, auto-acquisition)
- CS-10613 — Skill alignment: deduplicate, establish consistent homes, create
boxel-apiskill - CS-10670 — boxel-cli publishes tool definitions for factory consumption (tool delegation)
- CS-10666 — Create
boxel-apiskill (federated search, realm creation, auth model) - CS-10667 — Create
boxel-commandskill (host commands via prerenderer) - CS-10593 — Claude Code native LLM support (ClaudeCodeFactoryAgent)
- CS-10594 — Codex CLI native support
Any code that makes an HTTP call to the realm server or Matrix API must live in boxel-cli. The software factory never calls realm APIs directly — it imports from boxel-cli. This is not a convenience; it is a hard boundary.
This means:
realm-operations.ts(20 functions wrapping realm HTTP endpoints) → migrates to boxel-cli- Auth helpers (
realm-auth.ts,boxel.tsMatrix/OpenID flows) → migrate to boxel-cli - Skills that teach realm API concepts (search queries, federated endpoints, auth model) → live with boxel-cli
- The factory keeps only orchestration logic: the ralph loop, test execution orchestration, bootstrap flow, and issue scheduling
The factory becomes a pure consumer of boxel-cli's API layer. It calls boxel sync, boxel pull, boxel create, or imports boxel-cli's programmatic API — it never constructs HTTP requests to realm endpoints.
The realm-operations.ts module was designed as a centralized, self-contained set of realm API wrappers with no factory-specific logic. It migrates wholesale:
| Function | Endpoint | boxel-cli Home |
|---|---|---|
searchRealm() |
QUERY /_search |
Evolves into federated search via /_federated-search |
readFile() |
GET /<path> |
Absorbed by boxel pull / programmatic read API |
writeFile() |
POST /<path> |
Absorbed by boxel sync / programmatic write API |
deleteFile() |
DELETE /<path> |
Absorbed by boxel sync --prefer-local with deletions |
atomicOperation() |
POST /_atomic |
Already implemented in boxel-cli's batch upload |
runRealmCommand() |
POST /_run-command |
New boxel command subcommand (CS-10416) |
createRealm() |
POST /_create-realm |
New boxel create-realm subcommand |
getServerSession() |
POST /_server-session |
Part of boxel-cli's auth layer |
getRealmScopedAuth() |
POST /_realm-auth |
Part of boxel-cli's auth layer |
cancelAllIndexingJobs() |
POST /_cancel-indexing-job |
New boxel-cli API |
waitForRealmReady() |
GET /_readiness-check |
New boxel-cli API |
waitForRealmFile() |
GET /<path> (polling) |
New boxel-cli API |
pullRealmFiles() |
GET /_mtimes + files |
Already boxel pull (auth managed by boxel-cli per CS-10642) |
addRealmToMatrixAccountData() |
Matrix account data API | Part of boxel-cli's auth/profile layer |
Auth helpers in realm-auth.ts and boxel.ts (Matrix login, OpenID token, realm server token, per-realm JWTs) also migrate to boxel-cli's auth layer.
After migration, realm-operations.ts is deleted. Direct fetch() calls to realm endpoints in factory-bootstrap.ts and factory-target-realm.ts are replaced with boxel-cli imports.
The current searchRealm() targets a single specified realm. In boxel-cli, this evolves into a federated search backed by the realm server's /_federated-search endpoint, which searches across all realms the user has access to using multiRealmAuthorization.
The initial implementation uses /_federated-search only. The realm server also exposes /_federated-search-prerendered, /_federated-types, and /_federated-info, but these are not in scope for the initial integration.
For the locally synced target realm, the LLM uses native grep/find — no API call needed. Federated search is for querying remote realms (catalog, base realm, other users' realms).
Since boxel-cli owns the Boxel API surface, skills that teach realm API concepts live with boxel-cli:
boxel-api(new skill) — search query syntax, federated endpoints, realm creation, auth model. Lives atpackages/boxel-cli/.agents/skills/- CLI command skills (
boxel-sync,boxel-track, etc.) — already CLI-specific. Live atpackages/boxel-cli/.agents/skills/ - Card domain knowledge (
boxel-development,boxel-file-structure) — not API-specific, applies to anyone working with cards. Lives at root.agents/skills/ - Factory orchestration (
software-factory-operations) — ralph loop, factory tools. Lives atpackages/software-factory/.agents/skills/
Phase 1 uses HTTP API calls (realm-operations.ts) as the primary realm I/O path. Boxel-cli exists and has profile-based auth, but its auth model isn't flexible enough for the factory's needs — specifically, obtaining auth tokens for newly created realms on the fly. Boxel-cli also lives in a separate repository (cardstack/boxel-cli), making it difficult to evolve in lockstep with factory requirements.
Phase 2 solves both problems: integrate boxel-cli into the monorepo as a first-class package, extend its auth model to handle dynamically created realms, and use it as the primary realm I/O layer.
With boxel-cli, the agent gets a local directory that mirrors a realm:
- LLMs are already fluent with filesystem tools —
cat,grep,ls,rm, file writes. No customread_file/write_file/search_realmwrappers needed for basic operations. - Batch writes are trivial — write files locally, then
boxel sync . --prefer-localto push them all at once. - CLI skills become usable — the 6 CLI skills excluded in phase 1 become available to the factory agent.
- Test files run directly — the agent writes
.spec.tsfiles to a local directory and Playwright runs them without pulling from a remote realm first.
Boxel-cli currently lives in a separate repository (cardstack/boxel-cli). Phase 2 moves it into the Boxel monorepo as packages/boxel-cli:
- Import the package into
packages/boxel-cliwith its existing source, tests, and build configuration - Wire it into the monorepo — add it to the pnpm workspace, ensure it builds alongside other packages, integrate with CI (linting, type-checking, test suite)
- Make it a dependency of
packages/software-factoryso factory scripts can import CLI utilities directly (e.g., sync logic, auth helpers) rather than shelling out tonpx boxel - Preserve the standalone CLI —
npx boxelandnpm install -g boxel-climust continue to work for human users
Being in the monorepo means:
- Changes to boxel-cli and the factory can land in the same PR
- The factory's CI runs against the exact boxel-cli version it depends on — no version drift
- Boxel-cli gets the same CI rigor as other packages: linting, type-checking, thorough test coverage
- Shared types and utilities can be extracted to
runtime-commoninstead of being duplicated
Boxel-cli already has profile-based auth — users log in via boxel profile add, and the CLI uses stored credentials to authenticate with realm servers. But the factory creates new realms on the fly and immediately needs to read/write to them. Profile-based auth only knows about realms the user has manually configured.
The principle that boxel-cli owns the entire Boxel API surface extends to auth. The factory should never touch a JWT directly — boxel-cli manages the full token lifecycle internally:
-
Two-tier token model — boxel-cli understands both realm server tokens (obtained via Matrix OpenID →
POST /_server-session, grants server-level access) and per-realm tokens (obtained viaPOST /_realm-auth, grants access to specific realms). Both are cached and refreshed automatically. -
Automatic token acquisition on realm creation — When
boxel create-realmcreates a new realm, boxel-cli automatically waits for readiness, obtains the per-realm JWT, and stores it in its auth state. Subsequentboxel pull/boxel syncon that realm Just Work — tokens are managed internally by boxel-cli. -
Programmatic auth API — Export a
BoxelAuthclass (or similar) so the factory imports it and never constructs HTTP requests or manages tokens:import { BoxelAuth } from '@cardstack/boxel-cli'; const auth = new BoxelAuth(credentials); await auth.createRealm({ name, owner }); // token auto-acquired await auth.pull(realmUrl, workspaceDir); // uses stored token await auth.sync(workspaceDir, { preferLocal: true });
-
Token refresh for long-running operations — The factory loop runs for hours. boxel-cli's
RealmAuthClientalready has token refresh with 60s lead time — this extends to cover all realm operations so long-running sessions don't fail mid-stream.
After this, the factory deletes realm-auth.ts, auth portions of boxel.ts, and all authorization/serverToken/realmTokens fields threaded through its config types.
Phase 1 creates realms by calling POST /_create-realm directly. Phase 2 moves this into boxel-cli as a first-class command. The exact CLI arguments are still being worked through, but the principle is:
- Boxel-cli already knows which realm server it's authenticated with (from the active profile). It should not require the realm server URL as a CLI argument.
- After creating a realm, boxel-cli incorporates the new realm's auth token into its auth state so subsequent commands (
boxel sync,boxel pull, etc.) work immediately. - The factory's
factory-target-realm.tsbecomes a thin wrapper that calls boxel-cli rather than making raw HTTP requests.
With boxel-cli integration, the factory's I/O model shifts from per-file HTTP calls to sync-based batch operations:
Agent calls write_file({ path: "sticky-note.gts", content: "...", realm: "target" })
→ orchestrator POSTs to realm HTTP API with card+source MIME type
→ repeat for each file
Agent writes files to local workspace directory using standard filesystem tools
→ boxel sync . --prefer-local (pushes all changes to realm in one batch)
→ or boxel track . --push (auto-pushes as files change)
This means:
write_fileandread_filewrapper tools are replaced by the LLM's native filesystem tools. The agent writes to./sticky-note.gtsdirectly.search_realmis replaced by a combination of localgrep/findfor file-level searches andboxel-search(or thesearch-realmscript tool) for structured card queries that require the realm index.realm-read,realm-write,realm-deleteremain available for operations that must happen immediately on the live realm (e.g., updating a ticket status that another process is watching), but they are no longer the primary I/O path.realm-atomicremains for transactional multi-file operations where partial failure is unacceptable.
Some operations are inherently server-side and cannot be replaced by local file I/O. These remain as factory tools but are backed by boxel-cli imports — no direct HTTP calls from the factory:
search_realms— federated search across all accessible realms via boxel-cli wrapping/_federated-searchrun_command— host commands via prerenderer, backed by boxel-cli wrapping/_run-commandrun_tests— Playwright orchestration (factory-specific, uses boxel-cli for file pulls)signal_done/request_clarification— control flow signals back to the ralph loop (factory-only, no API call)realm-create— backed by boxel-cli'sBoxelAuth.createRealm()with auto token acquisition (CS-10642)
Auth tools (realm-server-session, realm-auth) are fully absorbed into boxel-cli's auth layer per CS-10642 — the factory never manages tokens.
The ToolRegistry in phase 2 includes all three categories:
// Phase 1: only SCRIPT_TOOLS + REALM_API_TOOLS
// Phase 2: all tools available
let allManifests = [...SCRIPT_TOOLS, ...BOXEL_CLI_TOOLS, ...REALM_API_TOOLS];BOXEL_CLI_TOOLS (boxel-sync, boxel-push, boxel-pull, boxel-status, boxel-create, boxel-history) become available to the agent. The factory-level wrapper tools (write_file, read_file, search_realm) can be retired or kept as convenience aliases that delegate to the filesystem + sync.
The 6 CLI skills excluded in phase 1 (boxel-sync, boxel-track, boxel-watch, boxel-restore, boxel-repair, boxel-setup) are re-enabled in the skill resolver. The CLI_ONLY_SKILLS exclusion list in factory-skill-loader.ts is removed.
Beyond re-enablement, CS-10613 performs a full skill alignment:
- Deduplication — 8 of 9 factory skills are identical copies in boxel-cli. Each skill gets a single source of truth.
- Consistent homes — Skills are placed based on what they teach:
- CLI commands + realm API →
packages/boxel-cli/.agents/skills/(boxel-sync, boxel-track, boxel-watch, boxel-repair, boxel-restore, boxel-setup, boxel-api NEW) - Card domain knowledge → root
.agents/skills/(boxel-development, boxel-file-structure) - Factory orchestration →
packages/software-factory/.agents/skills/(software-factory-operations)
- CLI commands + realm API →
- New
boxel-apiskill — Consolidates scattered realm API knowledge (search queries, federated endpoints, auth model, realm creation) into a canonical reference at boxel-cli. This fills the current gap where no skill covers federated endpoints, realm creation, or auth flows. - Skill content rewrite — All skills updated to remove references to retired HTTP tools (
write_file,read_file,search_realm). Skills teach Boxel-specific domain knowledge only — not how to read/write files (the LLM already knows). - Loader updates — Factory's custom skill loader updated with fallback dirs: primary (software-factory) → fallback 1 (boxel-cli) → fallback 2 (root). Both Claude Code's native loader and the factory's programmatic loader read from the same skill files via symlinks.
The refactor happens in stages to avoid a big-bang rewrite:
- Stage 1: Monorepo import — Move boxel-cli into
packages/boxel-cli. Set up CI (linting, type-checking, tests). All existing factory code continues to use HTTP-based realm operations unchanged. - Stage 2: Auth extension (CS-10642) — Extend boxel-cli auth to automatically acquire and store tokens for newly created realms. Add programmatic auth API. Factory tests verify that
boxel createfollowed byboxel syncworks seamlessly for factory-created realms. - Stage 3: Sync-based workspace — Factory entrypoint syncs the target realm to a local workspace before starting the agent loop. Agent writes files locally. A post-iteration sync pushes changes to the realm.
- Stage 4: Retire HTTP wrappers — Remove
realm-operations.tsstopgap functions (writeModuleSource,readCardSource,writeCardSource,pullRealmFiles). Replace with boxel-cli calls. KeepsearchRealmfor structured queries. - Stage 5: Re-enable CLI skills — Remove the
CLI_ONLY_SKILLSfilter from the skill resolver. Update CLI skill content for the factory agent context.
factory-tool-builder.ts currently hardcodes every tool's name, description, JSON schema parameters, and execute function (~14 tool definitions). When tools migrate to boxel-cli, the factory shouldn't have to maintain definitions for tools it doesn't own — that creates a coupling problem where parameter changes in boxel-cli require matching updates in the factory.
The fix: boxel-cli publishes its own tool surface and the factory consumes it via delegation.
boxel-cli exports a function that returns tool definitions:
// In @cardstack/boxel-cli
export function getToolDefinitions(auth: BoxelAuth): BoxelToolDefinition[] {
return [
{
name: 'search_realms',
description: 'Federated search across all accessible realms',
parameters: {
/* JSON Schema */
},
execute: async (params) => auth.federatedSearch(params.query),
},
{
name: 'run_command',
description: 'Execute a host command via the prerenderer',
parameters: {
/* JSON Schema */
},
execute: async (params) => auth.runCommand(params.command, params.input),
},
// ... all boxel-cli tools, each with schema + implementation
];
}The factory tool builder becomes a thin composition layer:
// In software-factory
import { getToolDefinitions } from '@cardstack/boxel-cli';
function buildTools(auth: BoxelAuth): FactoryTool[] {
const cliTools = getToolDefinitions(auth); // delegated — boxel-cli owns these
const factoryTools = [
{ name: 'signal_done', ... }, // factory-only
{ name: 'request_clarification', ... }, // factory-only
{ name: 'run_tests', ... }, // factory-specific Playwright orchestration
];
return [...cliTools, ...factoryTools];
}This means:
- Single source of truth — boxel-cli owns the name, description, schema, and implementation for its tools
- Factory tool builder shrinks — from ~14 manually defined tools to 3-4 factory-specific ones
- No coupling — adding or changing a boxel-cli tool automatically reflects in the factory with zero factory code changes
- Skill alignment — the
boxel-apiskill (CS-10666) and tool definitions are co-located in boxel-cli, so they stay in sync
A natural evolution is for boxel-cli to expose its tools as an MCP (Model Context Protocol) server. This would allow Claude Code, Codex CLI, or any MCP-compatible agent to discover and call boxel-cli tools directly — without the factory as intermediary.
In this model:
- boxel-cli runs
boxel mcp-server(or is configured as an MCP server in.claude/settings.json) - Claude Code connects and discovers all available tools:
search_realms,create_realm,run_command,sync,pull,push, etc. - The ralph loop can also connect as an MCP client when invoking the agent, so the agent gets boxel-cli tools alongside factory tools
- Tool definitions, schemas, and descriptions are served dynamically — always up to date
This ties into CS-10418 (realms exposing MCP servers) and creates a consistent tool discovery pattern across the Boxel ecosystem. The programmatic manifest (Option A above) is the right first step because it's simpler and works today. MCP is the path once the protocol stabilizes and tool discovery becomes the standard for agent runtimes.
With tool delegation, the factory only manually defines tools it uniquely owns:
| Tool | Owner | How it's defined |
|---|---|---|
search_realms |
boxel-cli | Delegated via getToolDefinitions() |
run_command |
boxel-cli | Delegated via getToolDefinitions() |
create_realm |
boxel-cli | Delegated via getToolDefinitions() |
run_tests |
software-factory | Manual — factory-specific Playwright orchestration |
signal_done |
software-factory | Manual — control flow signal to ralph loop |
request_clarification |
software-factory | Manual — control flow signal to ralph loop |
All retired tools (write_file, read_file, search_realm, update_ticket, update_project, create_knowledge, create_catalog_spec, realm-read, realm-write, realm-delete) are gone — replaced by native LLM file I/O + boxel sync.
- Issue creation during execution: Can the agent create new issues mid-loop (e.g., "I found a bug, creating a fix issue")? This is powerful but needs guardrails to prevent issue explosion.
- Parallel execution: Can multiple non-dependent issues execute in parallel? Phase 2 starts serial, but the issue graph naturally supports parallelism.
- Max iterations per issue: Should this be a property on the issue, or a global default? Some issues (test execution) may need more retries than others.
- Issue type taxonomy: What's the minimal set of issue types? Candidates:
implement,test-write,test-execute,bootstrap,knowledge,review. - Failure escalation: When an issue fails after max retries, should it block dependents automatically, or should the agent decide?
- Workspace lifecycle: When the factory creates a new target realm and syncs it locally, where does the local workspace live? Options: a temp directory (cleaned up on exit), a stable path under
.claude/worktrees/, or a user-specified path. - Concurrent realm writes: If the agent writes files locally while
boxel track --pushis running, how do we prevent partial pushes? Options: write-then-sync (no track), edit locks, or batched sync after agent exits.
The sequence diagram in architecture.md shows the target state:
loop Until no unblocked issues left (or max iterations reached)
Factory->>ClaudeCodeCLI: Invoke With Prompt
ClaudeCodeCLI->>HostedBoxel: Work issue and update issue status when done
Phase 2 implements exactly this. The "Factory" is the thin orchestrator/scheduler. "ClaudeCodeCLI" is the LoopAgent. "HostedBoxel" is the realm accessed via FactoryTool[] and local workspace sync. The issue selection, dependency resolution, and status updates are the orchestrator's only responsibilities.
With boxel-cli integration, "HostedBoxel" is accessed through a synced local workspace rather than direct HTTP calls — the agent works on local files and boxel-cli handles the synchronization. This matches how human developers use Boxel: edit locally, sync to server.