Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
**Problem:** When the executor launches `claude -p`, it logs "output will stream below" but no output actually streams to the terminal — all output is silently redirected to the log file via `>> "${LOG_FILE}" 2>&1`.

**Files Analyzed:**

- `scripts/night-watch-helpers.sh` — `log()` function (writes ONLY to file)
- `scripts/night-watch-cron.sh` — provider dispatch (lines 545-577, 624-637)
- `scripts/night-watch-audit-cron.sh` — provider dispatch (lines 163-189)
Expand All @@ -17,6 +18,7 @@
- `packages/core/src/utils/shell.ts` — `executeScriptWithOutput()` (already streams child stdout/stderr to terminal)

**Current Behavior:**

- `log()` writes ONLY to `LOG_FILE` (`echo ... >> "${log_file}"`) — not to stdout or stderr
- Provider commands redirect ALL output to file: `claude -p ... >> "${LOG_FILE}" 2>&1`
- Node's `executeScriptWithOutput()` listens on the bash child's stdout/stderr pipes but receives nothing because the bash script sends everything to the file
Expand All @@ -25,11 +27,13 @@
## 2. Solution

**Approach:**

- Replace `>> "${LOG_FILE}" 2>&1` with `2>&1 | tee -a "${LOG_FILE}"` for provider dispatch — output goes to both the log file AND stdout (which propagates through Node's pipe to the terminal)
- Modify `log()` to also write to stderr so diagnostic messages are visible in the terminal during interactive `night-watch run`
- All scripts already use `set -euo pipefail`, so pipe exit codes propagate correctly (if `claude` fails with code 1 and `tee` succeeds with 0, pipefail returns 1)

**Key Decisions:**

- `tee -a` (append mode) preserves the existing log file behavior
- Provider output goes to stdout via tee; diagnostic messages go to stderr via log — keeps them on separate channels
- No changes to `executeScriptWithOutput()` needed — it already streams both pipes to the terminal
Expand All @@ -41,6 +45,7 @@
### Phase 1: Fix `log()` to also write to stderr + use `tee` for provider output

**Files (5):**

- `scripts/night-watch-helpers.sh` — make `log()` also write to stderr
- `scripts/night-watch-cron.sh` — replace `>> "${LOG_FILE}" 2>&1` with `2>&1 | tee -a "${LOG_FILE}"` (3 occurrences: main dispatch, codex dispatch, fallback)
- `scripts/night-watch-audit-cron.sh` — same replacement (2 occurrences)
Expand All @@ -50,6 +55,7 @@
**Implementation:**

- [ ] In `night-watch-helpers.sh`, modify `log()` to also echo to stderr:

```bash
log() {
local log_file="${LOG_FILE:?LOG_FILE not set}"
Expand Down Expand Up @@ -82,6 +88,7 @@
**Pattern for each replacement:**

Before:

```bash
if (
cd "${WORKTREE_DIR}" && timeout "${SESSION_MAX_RUNTIME}" \
Expand All @@ -92,6 +99,7 @@ if (
```

After:

```bash
if (
cd "${WORKTREE_DIR}" && timeout "${SESSION_MAX_RUNTIME}" \
Expand All @@ -102,13 +110,15 @@ if (
```

**Exit code behavior with `pipefail`:**

- All scripts use `set -euo pipefail` (line 2)
- If `timeout ... claude` exits 124 (timeout) and `tee` exits 0 → pipe returns 124 ✓
- If `timeout ... claude` exits 1 (failure) and `tee` exits 0 → pipe returns 1 ✓
- If `timeout ... claude` exits 0 (success) and `tee` exits 0 → pipe returns 0 ✓
- The `if (...); then` construct disables `set -e` for the condition, so non-zero exits are captured correctly

**Rate-limit detection still works:**

- `check_rate_limited` greps the LOG_FILE — `tee -a` still writes everything to the file, so this is unchanged

**Tests Required:**
Expand All @@ -120,6 +130,7 @@ if (
| Manual | Smoke test: `bash -n scripts/night-watch-pr-reviewer-cron.sh` | No syntax errors |

**User Verification:**

- Action: Run `night-watch run` (or trigger executor)
- Expected: Diagnostic log messages AND claude's streaming output visible in the terminal in real time

Expand All @@ -139,10 +150,10 @@ if (

## Files to Modify

| File | Change |
|------|--------|
| `scripts/night-watch-helpers.sh` | `log()` also writes to stderr |
| `scripts/night-watch-cron.sh` | 3× replace `>> LOG 2>&1` with `2>&1 \| tee -a LOG` |
| `scripts/night-watch-audit-cron.sh` | 2× same replacement |
| `scripts/night-watch-qa-cron.sh` | 2× same replacement |
| `scripts/night-watch-pr-reviewer-cron.sh` | 2× same replacement |
| File | Change |
| ----------------------------------------- | -------------------------------------------------- |
| `scripts/night-watch-helpers.sh` | `log()` also writes to stderr |
| `scripts/night-watch-cron.sh` | 3× replace `>> LOG 2>&1` with `2>&1 \| tee -a LOG` |
| `scripts/night-watch-audit-cron.sh` | 2× same replacement |
| `scripts/night-watch-qa-cron.sh` | 2× same replacement |
| `scripts/night-watch-pr-reviewer-cron.sh` | 2× same replacement |
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ PRD execution is failing consistently across projects (night-watch-cli, autopilo
When the proxy returns 429, the system correctly triggers a native Claude fallback. **However, if native Claude is also rate-limited**, the fallback exits with code 1 and the system records `provider_exit` instead of `rate_limited`.

**Evidence from `logs/executor.log`:**

```
API Error: 429 {"error":{"code":"1308","message":"Usage limit reached for 5 hour..."}}
RATE-LIMITED: Proxy quota exhausted — triggering native Claude fallback
Expand All @@ -24,6 +25,7 @@ FAIL: Night watch exited with code 1 while processing 69-ux-revamp...
```

**Impact:** The system records a `failure` with `reason=provider_exit` instead of `rate_limited`, which:

- Triggers a long cooldown (max_runtime-based) instead of a rate-limit-appropriate retry
- Sends misleading failure notifications
- Prevents the PRD from being retried once the rate limit resets
Expand All @@ -33,6 +35,7 @@ FAIL: Night watch exited with code 1 while processing 69-ux-revamp...
The function scans `tail -50` of the shared `executor.log` file, but log entries from **previous runs** can bleed into the current run's error detail.

**Evidence:** Issue #70's failure detail contains issue #69's error message:

```
detail=[2026-03-07 00:40:59] [PID:75449] FAIL: Night watch exited with code 1 while processing 69-ux-revamp...
```
Expand All @@ -44,6 +47,7 @@ This happens because the log is append-only and `latest_failure_detail()` doesn'
In filesystem mode, `code-cleanup-q1-2026.md` was selected and executed despite the work already being merged to master. Claude correctly identified the work was done but didn't create a PR. The cron script then recorded `failure_no_pr_after_success`.

**Evidence:**

```
OUTCOME: exit_code=0 total_elapsed=363s prd=code-cleanup-q1-2026.md
WARN: claude exited 0 but no open/merged PR found on night-watch/code-cleanup-q1-2026
Expand All @@ -62,6 +66,7 @@ This is a pre-existing filesystem mode issue (stale PRDs not moved to `done/`).
After the native Claude fallback runs (line ~626), check if the fallback also hit a rate limit before falling through to the generic failure handler.

**Implementation:**

1. After `RATE_LIMIT_FALLBACK_TRIGGERED` block (lines 603-632), if `EXIT_CODE != 0`, scan fallback output for rate-limit indicators (`"hit your limit"`, `429`, `"Usage limit"`)
2. If detected, set a new flag `DOUBLE_RATE_LIMITED=1`
3. In the outcome handler (lines 711-726), when `DOUBLE_RATE_LIMITED=1`:
Expand All @@ -72,6 +77,7 @@ After the native Claude fallback runs (line ~626), check if the fallback also hi
**Specific changes in `night-watch-cron.sh`:**

After line 632 (`fi` closing the fallback block), add:

```bash
# Detect double rate-limit: both proxy AND native Claude exhausted
DOUBLE_RATE_LIMITED=0
Expand All @@ -84,6 +90,7 @@ fi
```

In the outcome handler, add a new branch before the generic `else` on line 711:

```bash
elif [ "${DOUBLE_RATE_LIMITED}" = "1" ]; then
if [ -n "${ISSUE_NUMBER}" ]; then
Expand All @@ -102,13 +109,15 @@ elif [ "${DOUBLE_RATE_LIMITED}" = "1" ]; then
Modify `latest_failure_detail()` to accept an optional `since_line` parameter that filters to only lines written during the current run.

**Implementation:**

1. Change `latest_failure_detail()` (lines 79-92) to accept a second parameter `since_line`
2. Use `tail -n +${since_line}` instead of `tail -50` when `since_line` is provided
3. At the call site (line 712), pass the `LOG_LINE_BEFORE` captured at the start of the current attempt

**Specific changes:**

Replace `latest_failure_detail()`:

```bash
latest_failure_detail() {
local log_file="${1:?log_file required}"
Expand Down Expand Up @@ -138,6 +147,7 @@ latest_failure_detail() {
```

Update call site at line 712:

```bash
PROVIDER_ERROR_DETAIL=$(latest_failure_detail "${LOG_FILE}" "${LOG_LINE_BEFORE}")
```
Expand Down Expand Up @@ -182,6 +192,6 @@ This is already handled — the `code-cleanup-q1-2026.md` issue was a one-time s

## Files to Modify

| File | Change |
|------|--------|
| File | Change |
| ----------------------------- | -------------------------------------------------------------------- |
| `scripts/night-watch-cron.sh` | Add double-rate-limit detection, scope failure detail to current run |
37 changes: 27 additions & 10 deletions docs/PRDs/job-registry.md → docs/PRDs/done/job-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
**Problem:** Adding a new job type (e.g., analytics) requires touching 15+ files across 4 packages — types, constants, config normalization, env parsing, CLI command, server routes, API client, Scheduling UI, Settings UI, schedule templates, and more. Each job's state shape is inconsistent (executor/reviewer use top-level flat fields, qa/audit/analytics use nested config objects, slicer lives inside `roadmapScanner`).

**Files Analyzed:**

- `packages/core/src/types.ts` — `JobType`, `IJobProviders`, `INightWatchConfig`, `IQaConfig`, `IAuditConfig`, `IAnalyticsConfig`
- `packages/core/src/shared/types.ts` — duplicated type definitions for web contract
- `packages/core/src/constants.ts` — `DEFAULT_*` per job, `VALID_JOB_TYPES`, `DEFAULT_QUEUE_PRIORITY`, `LOG_FILE_NAMES`
Expand All @@ -22,6 +23,7 @@
- `web/store/useStore.ts` — minimal Zustand, no job-specific state

**Current Behavior:**

- 6 job types exist: `executor`, `reviewer`, `qa`, `audit`, `slicer`, `analytics`
- Executor/reviewer use flat top-level config fields (`cronSchedule`, `reviewerSchedule`, `executorEnabled`, `reviewerEnabled`)
- QA/audit/analytics use nested config objects (`config.qa`, `config.audit`, `config.analytics`) with common shape: `{ enabled, schedule, maxRuntime, ...extras }`
Expand All @@ -32,13 +34,15 @@
## 2. Solution

**Approach:**

1. Create a **Job Registry** in `packages/core/src/jobs/` that defines each job's metadata, defaults, config access patterns, and env parsing rules in a single object
2. Extract a **`IBaseJobConfig`** interface (`{ enabled, schedule, maxRuntime }`) that all job configs extend
3. Replace per-job boilerplate in `config-normalize.ts` and `config-env.ts` with generic registry-driven loops
4. Create a **web-side Job Registry** (`web/utils/jobs.ts`) with UI metadata (icons, labels, trigger functions, config accessors) that Scheduling/Settings pages iterate over
5. Add a **Zustand `jobs` slice** that provides computed job state derived from `status.config` so components don't need to know each job's config shape

**Architecture Diagram:**

```mermaid
flowchart TB
subgraph Core["@night-watch/core"]
Expand All @@ -63,12 +67,14 @@ flowchart TB
```

**Key Decisions:**

- **Migrate executor/reviewer to nested config**: All jobs will use `config.jobs.{id}: { enabled, schedule, maxRuntime, ...extras }`. Auto-detect legacy flat format and migrate on load. This is a breaking config change but gives uniform access patterns.
- **Registry is a const array, not DI**: Simple, testable, no runtime overhead
- **Web job registry stores React components directly** for icons (type-safe, tree-shakeable)
- **Generic `triggerJob(jobId)`** replaces per-job `triggerRun()`, `triggerReview()` etc. (keep old functions as thin wrappers for backward compat)

**Data Changes:**

- `INightWatchConfig` gains `jobs: Record<JobType, IBaseJobConfig & extras>` — replaces flat executor/reviewer fields and nested qa/audit/analytics objects
- Legacy flat fields (`cronSchedule`, `reviewerSchedule`, `executorEnabled`, `reviewerEnabled`) and nested objects (`qa`, `audit`, `analytics`, `roadmapScanner.slicerSchedule`) auto-detected and migrated on config load
- Config file rewritten in new format on first save after migration
Expand Down Expand Up @@ -100,6 +106,7 @@ sequenceDiagram
**User-visible outcome:** Job registry exists and is the single source of truth for job metadata. All constants derived from it. Tests prove registry drives normalization.

**Files (5):**

- `packages/core/src/jobs/job-registry.ts` — **NEW** — `IJobDefinition` interface + `JOB_REGISTRY` array + accessor utilities
- `packages/core/src/jobs/index.ts` — **NEW** — barrel exports
- `packages/core/src/types.ts` — add `IBaseJobConfig` interface
Expand All @@ -110,21 +117,22 @@ sequenceDiagram

- [ ] Define `IBaseJobConfig` interface: `{ enabled: boolean; schedule: string; maxRuntime: number }`
- [ ] Define `IJobDefinition<TConfig extends IBaseJobConfig = IBaseJobConfig>` interface with:

```typescript
interface IJobDefinition<TConfig extends IBaseJobConfig = IBaseJobConfig> {
id: JobType;
name: string; // "Executor", "QA", "Auditor"
description: string; // "Creates implementation PRs from PRDs"
cliCommand: string; // "run", "review", "qa", "audit", "planner", "analytics"
logName: string; // "executor", "reviewer", "night-watch-qa", etc.
lockSuffix: string; // ".lock", "-r.lock", "-qa.lock", etc.
queuePriority: number; // 50, 40, 30, 20, 10
name: string; // "Executor", "QA", "Auditor"
description: string; // "Creates implementation PRs from PRDs"
cliCommand: string; // "run", "review", "qa", "audit", "planner", "analytics"
logName: string; // "executor", "reviewer", "night-watch-qa", etc.
lockSuffix: string; // ".lock", "-r.lock", "-qa.lock", etc.
queuePriority: number; // 50, 40, 30, 20, 10

// Env var prefix for NW_* overrides
envPrefix: string; // "NW_EXECUTOR", "NW_QA", "NW_AUDIT", etc.
envPrefix: string; // "NW_EXECUTOR", "NW_QA", "NW_AUDIT", etc.

// Extra config field normalizers (beyond enabled/schedule/maxRuntime)
extraFields?: IExtraFieldDef[]; // e.g., QA's branchPatterns, artifacts, etc.
extraFields?: IExtraFieldDef[]; // e.g., QA's branchPatterns, artifacts, etc.

// Defaults
defaultConfig: TConfig;
Expand All @@ -133,6 +141,7 @@ sequenceDiagram
migrateLegacy?: (raw: Record<string, unknown>) => Partial<TConfig> | undefined;
}
```

- [ ] Create `JOB_REGISTRY` const array with entries for all 6 job types
- [ ] Create utility functions: `getJobDef(id)`, `getAllJobDefs()`, `getJobDefByCommand(cmd)`
- [ ] Derive `VALID_JOB_TYPES`, `DEFAULT_QUEUE_PRIORITY`, `LOG_FILE_NAMES` from registry (keep exports stable)
Expand All @@ -154,13 +163,15 @@ sequenceDiagram
**User-visible outcome:** `config-normalize.ts` and `config-env.ts` use generic loops. Legacy flat config auto-migrated. Adding a job config section no longer requires per-job blocks.

**Files (5):**

- `packages/core/src/jobs/job-registry.ts` — add `normalizeJobConfig()` and `buildJobEnvOverrides()` generic helpers
- `packages/core/src/config-normalize.ts` — replace per-job normalization blocks with registry loop + legacy migration
- `packages/core/src/config-env.ts` — replace per-job env blocks with registry loop
- `packages/core/src/config.ts` — `loadConfig()` detects legacy format and migrates in-memory (optionally rewrites file)
- `packages/core/src/__tests__/config-normalize.test.ts` — verify normalization + migration

**Implementation:**

- [ ] Add `normalizeJobConfig(rawConfig, jobDef)` that reads raw object, applies defaults, validates fields
- [ ] Each `IJobDefinition` declares `extraFields` for job-specific fields beyond `{ enabled, schedule, maxRuntime }`
- [ ] Add `migrateLegacyConfig(raw)` that detects old format (e.g., `cronSchedule` exists at top level) and transforms to new `jobs: { executor: { ... }, ... }` shape
Expand All @@ -184,17 +195,19 @@ sequenceDiagram
**User-visible outcome:** Scheduling and Settings pages read job definitions from a registry instead of hardcoded arrays. Zustand provides computed job state.

**Files (4):**

- `web/utils/jobs.ts` — **NEW** — Web-side job registry with UI metadata
- `web/store/useStore.ts` — add `jobs` computed slice derived from `status.config`
- `web/api.ts` — add generic `triggerJob(jobId)` function
- `web/utils/cron.ts` — derive schedule template keys from registry

**Implementation:**

- [ ] Create `IWebJobDefinition` extending core `IJobDefinition` with UI fields:
```typescript
interface IWebJobDefinition extends IJobDefinition {
icon: string; // lucide icon component name
triggerEndpoint: string; // '/api/actions/qa'
icon: string; // lucide icon component name
triggerEndpoint: string; // '/api/actions/qa'
scheduleTemplateKey: string; // key in IScheduleTemplate.schedules
settingsSection?: 'general' | 'advanced'; // where in Settings to show
}
Expand Down Expand Up @@ -223,11 +236,13 @@ sequenceDiagram
**User-visible outcome:** Scheduling page renders job cards from the registry. Adding a new job automatically shows it in Scheduling.

**Files (3):**

- `web/pages/Scheduling.tsx` — replace hardcoded `agents` array with registry-driven rendering
- `web/components/scheduling/ScheduleConfig.tsx` — use registry for form fields
- `web/utils/cron.ts` — update `IScheduleTemplate` to be extensible

**Implementation:**

- [ ] Replace the hardcoded `agents: IAgentInfo[]` array with `WEB_JOB_REGISTRY.map(job => ...)`
- [ ] Replace `handleJobToggle` if/else chain with generic `job.buildEnabledPatch(enabled, config)`
- [ ] Replace `handleTriggerJob` map with generic `triggerJob(job.id)`
Expand All @@ -247,11 +262,13 @@ sequenceDiagram
**User-visible outcome:** Settings page job config sections rendered from registry. Adding a new job auto-shows its settings.

**Files (3):**

- `web/pages/Settings.tsx` — replace per-job settings JSX blocks with registry loop
- `web/components/dashboard/AgentStatusBar.tsx` — use registry for process status
- `packages/server/src/routes/action.routes.ts` — generate routes from registry

**Implementation:**

- [ ] Settings: iterate `WEB_JOB_REGISTRY` to render job config sections
- [ ] Each `IWebJobDefinition` can declare its settings fields: `settingsFields: ISettingsField[]`
- [ ] `AgentStatusBar`: derive process list from registry instead of hardcoded
Expand Down
Loading
Loading