Skip to content

fix(agent_end): await backgroundRun with safety timeout instead of fire-and-forget#490

Open
Banger455 wants to merge 3 commits intoCortexReach:masterfrom
Banger455:fix/agent-end-await-backgroundrun
Open

fix(agent_end): await backgroundRun with safety timeout instead of fire-and-forget#490
Banger455 wants to merge 3 commits intoCortexReach:masterfrom
Banger455:fix/agent-end-await-backgroundrun

Conversation

@Banger455
Copy link
Copy Markdown
Contributor

Summary

Split from #406 – this PR focuses only on the agent_end async fix.

Problem

The agent_end auto-capture hook used void backgroundRun; (fire-and-forget). If the host agent exited before the background promise settled, the capture could be silently dropped — losing important memory entries.

Fix

index.ts

  • Widen AgentEndAutoCaptureHook return type to Promise<void> | void (backwards-compatible)
    • Make agentEndAutoCaptureHook an async function
    • Replace void backgroundRun; with await Promise.race([backgroundRun, timeout(15s)]) — ensures the capture completes (or times out gracefully) before the hook returns
    • Errors are caught and swallowed so auto-capture never breaks the host agent
      test/agent-end-async-capture.test.mjs (new)
  • 7 test cases: async type signature, async declaration, Promise.race usage, fire-and-forget removal, error swallowing, timer cleanup, early-return guard

Test plan

- Make AgentEndAutoCaptureHook return type allow Promise<void>
- Convert hook to async function
- Replace fire-and-forget `void backgroundRun` with Promise.race + 15s safety timeout
- Errors are swallowed so auto-capture never breaks the host agent
Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core fix is correct and addresses the fire-and-forget issue cleanly.

What's good:

  • Promise.race([backgroundRun, timeout(15s)]) is the right pattern — gives the capture a chance to complete without blocking indefinitely
  • Error swallowing (catch {}) ensures auto-capture never breaks the host agent — correct for a best-effort hook
  • clearTimeout in finally prevents timer leaks
  • Widening the return type to Promise<void> | void is backwards-compatible

One issue to be aware of (non-blocking):

  • The try/catch/finally block around Promise.race uses 4-space indentation while the surrounding IIFE body uses 8-space indentation. This is a cosmetic inconsistency — the code runs correctly, but it reads oddly in context. Worth a follow-up cleanup.

On the test approach: The tests read index.ts as a string and assert on source patterns (regex matching on source text). This is inherently brittle — any refactor that preserves behavior but changes text (e.g., renaming safetyTimer to timer) breaks the tests. Acceptable for now as structural smoke tests, but consider migrating to behavioral tests (mocking the hook and verifying it awaits) in the future.

Approving — the fix is sound and the indentation nit is cosmetic. Assigning to @rwmjhb for merge.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 4, 2026

Review: fix(agent_end): await backgroundRun with safety timeout instead of fire-and-forget

Real data-loss fix — void backgroundRun can drop memory entries when the host process exits before the background promise settles. Good direction.

Must Fix

1. Test file not wired into npm test and already fails

test/agent-end-async-capture.test.mjs is not in package.json's test script. Worse, when run directly, both assertions fail: the type regex doesn't match the actual __lastRun callable-object type, and the guard check looks for output instead of event.messages.

Fix: wire into npm test and fix the assertions to match the actual implementation.

2. Rebase needed — build is red (stale base)

Worth noting

  • The 15-second Promise.race is a mitigation, not a guarantee — slow embedding/storage paths can still exceed the timeout and drop captures. Consider making the timeout configurable or documenting this as a known limitation.
  • Stale "fire-and-forget" comment at index.ts:2644-2648 contradicts the new await behavior — update the comment.

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢这次修复——void backgroundRun 在进程退出前 drop 内存写入确实是真实的数据丢失 bug,改为 await + timeout 的方向是对的。有几个问题:

Must Fix

F4 — 注释与新行为矛盾,会误导未来维护者index.ts:2644

// Fire-and-forget: run capture work in the background so the hook
// returns immediately and does not hold the session lock.
// Blocking here causes downstream channel deliveries (e.g. Telegram) to be
// silently dropped when the session store lock times out.
// See: https://github.com/CortexReach/memory-lancedb-pro/issues/260

但 hook 现在会 await 最多 15s。注释明确警告 "blocking 会导致 channel delivery 丢失",未来维护者读到这里很可能把 await 当 bug revert 掉,重新引入数据丢失。请更新注释说明:OpenClaw core 调用 runAgentEnd() 本身是 fire-and-forget,session lock 在 hook 执行前已释放,所以在 hook 内 await 是安全的。

EF1 — Build failure(stale base)

strip-envelope-metadata.test.mjs:121,132 失败,与其他 PR 相同的 pre-existing 问题,但 stale_base=true。请 rebase 确认是 pre-existing 后在 PR 中说明。

EF2 — PR 新增的测试文件未被执行

test/agent-end-async-capture.test.mjs(83 行,7 个测试)未包含在 targeted test run 中,full suite 也因 build failure 中断,没有任何 pass/fail 信号。

MR1 — 15s timeout 可能短于插件自身允许的提取时间

如果 smart extraction 在正常负载下需要超过 15s,capture 仍然会被 drop。建议将 timeout 设置为可配置,或与插件的 extraction timeout 对齐。


Nice to Have

  • F3 (test/agent-end-async-capture.test.mjs:16): 第一个测试的 regex 期望箭头函数语法 = (...) => Promise<void>,但实际类型是 call-signature-in-object = { (...): Promise<void>; __lastRun?: ... },regex 无法匹配。
  • F6: import { describe, it, before }before 未使用,可以移除。
  • MR2: timeout 触发时完全静默——建议加 console.warn 或结构化日志,便于生产环境排查丢失的 capture。

方向正确,更新注释 + rebase 后可以合并。

…ehaviour

The comment at the top of the agent_end capture block still said
"Fire-and-forget" and warned that blocking would break channel
deliveries.  Since this PR changed the code to await with a 15 s
safety timeout, the comment was misleading and could cause a future
maintainer to revert the await.

Addresses rwmjhb's F4 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Banger455
Copy link
Copy Markdown
Contributor Author

@rwmjhb F4 addressed in 455cd7f.

Updated the comment block at the top of the agent_end capture to accurately describe the current await + 15s safety timeout behaviour, including why the timeout exists (session lock / channel delivery concern from #260) and what happens when it fires (graceful fallback to fire-and-forget semantics).

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix for a real data-loss bug. One comment needs updating and one architectural concern worth raising.

Must fix

F4 — Stale comment contradicts new behavior (lines 2644–2648)
The existing block comment still says "Fire-and-forget: run capture work in the background so the hook returns immediately". The hook now awaits with a 15s race, so this is no longer true and will mislead the next reader. Update the comment to describe the new semantics and reference both #260 (original fire-and-forget rationale) and the current fix.

MR1 — 15s timeout is shorter than the plugin's own max extraction time
If extraction is slow (large conversation, cold LLM), the race resolves to the timeout branch and the memory is still dropped — the same failure mode as before, just less frequent. Consider aligning the timeout with EXTRACTION_TIMEOUT_MS or making it configurable so the two values can't silently diverge.


Non-blocking

  • F3 — Test regex pattern doesn't match the actual TypeScript type definition syntax; tests pass by accident. Not blocking but the coverage is illusory.
  • F5 — 4-space try/catch inside 8-space IIFE body — indentation inconsistency.
  • F6before import appears unused after the refactor.
  • MR2 — When the timeout branch fires, the failure is completely silent. A logger.warn with the session key would aid debugging.

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fire-and-forget data-loss bug is real and worth fixing. One design issue to address before merge.

Must fix before merge:

  • MR1: The 15-second race timeout is shorter than the plugin's own allowed extraction time. The plugin permits extraction operations that can take longer than 15 seconds (e.g., LLM-based smart extraction). With the current timeout, slow captures are still silently dropped — the same data-loss scenario this PR set out to fix. The timeout should be set to at least the maximum extraction budget, or the await should not have a fixed race timeout at all.

  • MR2: Timeout exit is completely silent. When the 15-second race fires, no log or warning is emitted. Dropped captures will be invisible in diagnostics.

Nice to have (non-blocking):

  • F4: The inline comment still says "fire-and-forget" — please update to reflect the new await behavior.
  • F3: Test regex does not match the actual type definition syntax — test may not catch regressions.
  • EF2: The PR's own test file was not executed in the verification pipeline — please confirm tests pass.

Fix MR1 and MR2, then this is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants