Optional cross-worker lock for shared adaptor repo#1416
Conversation
|
ok @stuartc having looked a bit more closely - this implementation MUST move into the engine The engine already has an What is the sentinel stuff all about? |
| const locksDir = path.join(repoDir, '.locks'); | ||
| const sentinelsDir = path.join(repoDir, '.sentinels'); | ||
|
|
||
| const ensureDirs = (async () => { |
There was a problem hiding this comment.
This is weird. Why not remove the wrapper and just await the mkdir calls?
| logger.debug(`acquired install lock for ${specifier}`); | ||
|
|
||
| try { | ||
| const [hasSentinel, hasPkg] = await Promise.all([ |
There was a problem hiding this comment.
As we're going to move this to the engine, the engine already has an isInstalled function, which determines whether an adaptor is installed or not.
Rather than implement that test twice - as we're doing here - I'd like tot ry and re-use that logic if possible.
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.
Migrate the cross-worker adaptor install lock out of ws-worker and into the engine's autoinstall flow. The lock now lives next to the code that actually performs the install. - New `withInstallLock` primitive in engine-multi's util - `autoinstall.lockRepo` option (default on) wraps the install call - Strengthen `isInstalled` to also stat node_modules/<alias>/package.json, catching half-installed adaptors left by a crashed worker - Drop the sentinel cache; npm's per-package extraction is atomic so the node_modules entry is itself proof of completion - ws-worker CLI flag renamed to `--no-repo-lock` (WORKER_NO_REPO_LOCK) - Rework tests to use tmpdirs + esmock'd runtime install, dropping the handleInstall/handleIsInstalled injection hooks entirely
The pending map was left behind by a 2023 restructure that replaced per-adaptor promise dedup with the module-level queue + busy flag. Nothing writes to it; the delete on error is a no-op.
Defence-in-depth: refuse aliases containing `..` or absolute paths before composing the .locks/<alias>.lock path. Upstream whitelist filtering already restricts specifiers, so this guards against a future bug there becoming a path-traversal write.
Try the lock with retries: 0 first. On ELOCKED, log one info line so k8s operators can see why a worker is sat waiting on cold start, then retry with the full budget. Uncontended acquisitions stay silent.
Yargs auto-negates --no-foo into { foo: false }, so declaring the option
as 'no-repo-lock' meant --no-repo-lock produced { repoLock: false } and
left args.noRepoLock undefined. The downstream lockRepo check thus
stayed true and the flag did nothing.
Rename the option to its positive form 'repo-lock' (default true). yargs'
auto-negation now correctly sets repoLock: false when --no-repo-lock is
passed. WORKER_NO_REPO_LOCK env var still disables the lock.
Also add defence-in-depth path-traversal guard on the lock alias.
e968b73 to
5509997
Compare
ensureLockTarget already creates the parent directory recursively, so the explicit mkdir of locksDir was dead work.
The previous substring check would also reject innocuous names like 'foo..bar'. Split on path separators so only '..' as a real path component is rejected.
Hoist processQueue and doAutoinstall to module scope and pass context/repoDir/logger/useLock through the queue entry instead of relying on the first caller's closure. Removes a latent foot-gun if anyone later adds per-run autoinstall/repoDir overrides or proxies AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents.
Pure whitespace; line-wrap long template strings and object literals so `pnpm test:format` (CI) passes.
|
Thanks @josephjclark - all fair points, addressed. Lock lives in the engine now. New file Sentinels are gone. Those were the No duplicated
Two files as you suggested: A few smaller things came out of the review:
Changeset updated to declare both |
Short Description
Adds a filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same
npm install.Fixes #1414
Implementation Details
When the repo directory is shared between worker pods, two workers can otherwise hit
installfor the same adaptor at the same time and corrupt each other'snode_modules. The lock lives inengine-multi's autoinstall flow — that's where the install actually happens, so that's where the coordination belongs.How it works:
withInstallLockhelper inpackages/engine-multi/src/util/repo-lock.tsacquires a per-adaptor lockfile under<repoDir>/.locks/<alias>.lockviaproper-lockfile. Different adaptors don't block each other.installcall with the lock. Inside the lock it re-checksisInstalledand skips the install if another worker has just finished it.isInstalledis now stricter: it ANDs the existing repopackage.jsondependency check with a stat ofnode_modules/<alias>/package.json. npm extracts each package atomically, so the presence of that file is itself proof of completion. This also fixes a latent bug where a worker killed mid-install could leave the repopackage.jsonsaying "installed" whilenode_modulesis half-written, and the next worker would happily import broken code.isInstalledreturns true without touching the lock directory.installthrows, the lock is released and the next worker will retry.proper-lockfile's mtime heartbeat updates every 5s and another worker considers a lock stale after 5 min. The retry ceiling (6 min) sits above that, so when a pod dies mid-install its lock expires before waiting workers give up.On by default. Disable with
--no-repo-lock/WORKER_NO_REPO_LOCK=true. Single-worker deployments don't need to do anything to opt out — they'll just never hit contention and so never log anything new.New dep on
proper-lockfile@4.1.2(withgraceful-fs,retry,signal-exittransitives) inengine-multi. All checked for CVEs — clean.QA Notes
Interesting cases to exercise (with
--repo-dirset):--repo-dirstarting a run that needs the same adaptor at the same time — only onenpm installshould actually run; the other should see the install completed during its wait and skip.--no-repo-lock, multi-worker behaviour should be unchanged from before this PR.--repo-lockshould be indistinguishable from before — no contention, no extra logs.There's a multi-process test suite using
child_process.forkthat exercises the cross-worker cases deterministically —pnpm test test/util/repo-lock.test.tsfrompackages/engine-multi.K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.
AI Usage
You can read more details in our
Responsible AI Policy