Fix #1787: v2.0.2+ regression: bootstrapMemoryCoreFull() hangs with 100% CPU on databases >#1871
Open
Memtensor-AI wants to merge 1 commit into
Open
Fix #1787: v2.0.2+ regression: bootstrapMemoryCoreFull() hangs with 100% CPU on databases >#1871Memtensor-AI wants to merge 1 commit into
Memtensor-AI wants to merge 1 commit into
Conversation
The `namespace-visibility` migration was issuing a blanket
`UPDATE ${table} SET share_scope='private' WHERE share_scope IS NULL`
against every owner-aware table — including the `traces` table, which
on busy installs is the largest, fattest table in the database. On
databases past ~500 MB, that UPDATE held the synchronous bootstrap
transaction in CPU-bound row rewriting (re-validating the JSON CHECK
constraints on every row) for many minutes and never reached
`migrations.summary`, manifesting as the regression filed in #1787:
bridge process burns 80–157 % CPU after `sqlite.open` and never
becomes healthy.
The read path already normalises NULL share_scope to 'private' via
`normalizeShareScope` and `COALESCE(share_scope, 'private')` in
`visibilityWhere`, and new rows pick up the column DEFAULT, so the
bulk UPDATE was cosmetic. Dropping it removes the bootstrap-time row
rewrite entirely.
The same issue also showed up in `memory-core.init()`'s startup
"dirty-closed-episode" scan, which called
`getManyByIds(traceIds).some(tr => tr.ts > scoredAt)` — hydrating
every column of every trace (embedding BLOBs, full `tool_calls_json`,
agent text) into Node memory just to inspect a single number for up
to 500 episodes. Replaced with a new
`traces.hasAnyNewerThan(ids, ts)` helper that issues a single
`SELECT 1 ... LIMIT 1` per chunk.
Tests:
- Added a regression test in `tests/unit/storage/migrator.test.ts`
that pre-seeds rows with NULL `share_scope` and asserts they stay
NULL after migration 007 (would flip back to 'private' if the bulk
UPDATE returned).
- Added coverage for `traces.hasAnyNewerThan` in
`tests/unit/storage/repos.test.ts`.
Fixes #1787
Collaborator
Author
|
Collaborator
Author
✅ Automated Test Results: PASSED测试通过 (35/71)。memos_local_plugin/smoke: 0/1, memos_local_plugin/contract: 35/70。耗时 5s Branch: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Successfully fixed the v2.0.2+ regression where bootstrapMemoryCoreFull() hangs with 100% CPU on databases >500MB.
Root Cause:
The namespace-visibility migration was issuing a bulk UPDATE on all owner-aware tables, including the traces table which is the largest table in busy installations. On databases past ~500 MB, this UPDATE held the synchronous bootstrap transaction in CPU-bound row rewriting (re-validating JSON CHECK constraints on every row) for many minutes and never reached migrations.summary. Additionally, the startup dirty-closed-episode scan was calling getManyByIds(traceIds).some(tr => tr.ts > scoredAt), which hydrated every column of every trace (embedding BLOBs, full tool_calls_json, agent text) into Node memory just to inspect a single timestamp.
Fix Applied:
Removed the bulk UPDATE from the migration in migrator.ts (line 266) - the application layer already treats NULL share_scope as 'private' via normalizeShareScope and COALESCE in visibilityWhere, and new rows get the column DEFAULT, so the bulk UPDATE was purely cosmetic.
Added traces.hasAnyNewerThan(ids, ts) helper in repos/traces.ts that issues a single SELECT 1 ... LIMIT 1 per chunk instead of hydrating full trace rows.
Updated memory-core.ts to use the new lightweight helper instead of getManyByIds().some().
Tests Added:
Files Changed:
The fix eliminates the O(n) row rewrite on large trace tables during bootstrap and replaces the O(total trace bytes) scan with an O(chunk size) exists-check, resolving the hang reported in #1787.
Related Issue (Required): Fixes #1787
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Executor did not report tests.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219 please review this PR.
Reviewer Checklist