Skip to content

Optional cross-worker lock for shared adaptor repo#1416

Open
stuartc wants to merge 11 commits into
mainfrom
1414-distributed-adaptor-install-lock
Open

Optional cross-worker lock for shared adaptor repo#1416
stuartc wants to merge 11 commits into
mainfrom
1414-distributed-adaptor-install-lock

Conversation

@stuartc
Copy link
Copy Markdown
Member

@stuartc stuartc commented May 19, 2026

Short Description

Adds a filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same npm install.

Fixes #1414

Implementation Details

When the repo directory is shared between worker pods, two workers can otherwise hit install for the same adaptor at the same time and corrupt each other's node_modules. The lock lives in engine-multi's autoinstall flow — that's where the install actually happens, so that's where the coordination belongs.

How it works:

  • A new withInstallLock helper in packages/engine-multi/src/util/repo-lock.ts acquires a per-adaptor lockfile under <repoDir>/.locks/<alias>.lock via proper-lockfile. Different adaptors don't block each other.
  • Autoinstall wraps its existing install call with the lock. Inside the lock it re-checks isInstalled and skips the install if another worker has just finished it.
  • The engine's isInstalled is now stricter: it ANDs the existing repo package.json dependency check with a stat of node_modules/<alias>/package.json. npm extracts each package atomically, so the presence of that file is itself proof of completion. This also fixes a latent bug where a worker killed mid-install could leave the repo package.json saying "installed" while node_modules is half-written, and the next worker would happily import broken code.
  • The cache-hit path stays lock-free — isInstalled returns true without touching the lock directory.
  • If install throws, the lock is released and the next worker will retry.
  • Stale-lock recovery: proper-lockfile's mtime heartbeat updates every 5s and another worker considers a lock stale after 5 min. The retry ceiling (6 min) sits above that, so when a pod dies mid-install its lock expires before waiting workers give up.
  • When a worker has to wait for the lock, one info-level log line fires so the cause of a slow cold-start is visible without enabling debug logging.

On by default. Disable with --no-repo-lock / WORKER_NO_REPO_LOCK=true. Single-worker deployments don't need to do anything to opt out — they'll just never hit contention and so never log anything new.

New dep on proper-lockfile@4.1.2 (with graceful-fs, retry, signal-exit transitives) in engine-multi. All checked for CVEs — clean.

QA Notes

Interesting cases to exercise (with --repo-dir set):

  • Two workers pointing at the same --repo-dir starting a run that needs the same adaptor at the same time — only one npm install should actually run; the other should see the install completed during its wait and skip.
  • A worker killed mid-install (SIGKILL) — the next worker should recover after the stale window (5 min) rather than getting stuck forever.
  • Different adaptors should install in parallel, not serialise.
  • With --no-repo-lock, multi-worker behaviour should be unchanged from before this PR.
  • Single-worker behaviour with the default --repo-lock should be indistinguishable from before — no contention, no extra logs.

There's a multi-process test suite using child_process.fork that exercises the cross-worker cases deterministically — pnpm test test/util/repo-lock.test.ts from packages/engine-multi.

K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

@github-project-automation github-project-automation Bot moved this to New Issues in Core May 19, 2026
Comment thread packages/ws-worker/src/start.ts Outdated
Comment thread packages/ws-worker/src/util/repo-lock.ts Outdated
@josephjclark
Copy link
Copy Markdown
Collaborator

ok @stuartc having looked a bit more closely - this implementation MUST move into the engine

The engine already has an autoinstall.ts. You should just extend that to use the lockfile stuff before calling the runtime's install. I think you probably want the lock and install logic in two different files, but that's up to you.

What is the sentinel stuff all about?

const locksDir = path.join(repoDir, '.locks');
const sentinelsDir = path.join(repoDir, '.sentinels');

const ensureDirs = (async () => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Why not remove the wrapper and just await the mkdir calls?

logger.debug(`acquired install lock for ${specifier}`);

try {
const [hasSentinel, hasPkg] = await Promise.all([
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're going to move this to the engine, the engine already has an isInstalled function, which determines whether an adaptor is installed or not.

Rather than implement that test twice - as we're doing here - I'd like tot ry and re-use that logic if possible.

stuartc added 7 commits May 22, 2026 10:48
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs
across multiple workers sharing one repo directory (e.g. an NFS mount
or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache,
so the cache-hit path stays lock-free. Off by default; requires
--repo-dir.

Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead
holder's stale window expires before waiters give up, making cold-
start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST),
parallelise fileExists pairs in handleIsInstalled and the post-lock
double-check, collapse trivial if/return, drop unused default export,
and fix the worker test harness' mode list comment plus an any-typed
logger.
Migrate the cross-worker adaptor install lock out of ws-worker and
into the engine's autoinstall flow. The lock now lives next to the
code that actually performs the install.

- New `withInstallLock` primitive in engine-multi's util
- `autoinstall.lockRepo` option (default on) wraps the install call
- Strengthen `isInstalled` to also stat node_modules/<alias>/package.json,
  catching half-installed adaptors left by a crashed worker
- Drop the sentinel cache; npm's per-package extraction is atomic so
  the node_modules entry is itself proof of completion
- ws-worker CLI flag renamed to `--no-repo-lock` (WORKER_NO_REPO_LOCK)
- Rework tests to use tmpdirs + esmock'd runtime install, dropping
  the handleInstall/handleIsInstalled injection hooks entirely
The pending map was left behind by a 2023 restructure that replaced
per-adaptor promise dedup with the module-level queue + busy flag.
Nothing writes to it; the delete on error is a no-op.
Defence-in-depth: refuse aliases containing `..` or absolute paths
before composing the .locks/<alias>.lock path. Upstream whitelist
filtering already restricts specifiers, so this guards against a
future bug there becoming a path-traversal write.
Try the lock with retries: 0 first. On ELOCKED, log one info line
so k8s operators can see why a worker is sat waiting on cold start,
then retry with the full budget. Uncontended acquisitions stay silent.
Yargs auto-negates --no-foo into { foo: false }, so declaring the option
as 'no-repo-lock' meant --no-repo-lock produced { repoLock: false } and
left args.noRepoLock undefined. The downstream lockRepo check thus
stayed true and the flag did nothing.

Rename the option to its positive form 'repo-lock' (default true). yargs'
auto-negation now correctly sets repoLock: false when --no-repo-lock is
passed. WORKER_NO_REPO_LOCK env var still disables the lock.

Also add defence-in-depth path-traversal guard on the lock alias.
@stuartc stuartc force-pushed the 1414-distributed-adaptor-install-lock branch from e968b73 to 5509997 Compare May 22, 2026 11:37
stuartc added 4 commits May 22, 2026 14:59
ensureLockTarget already creates the parent directory recursively, so the
explicit mkdir of locksDir was dead work.
The previous substring check would also reject innocuous names like
'foo..bar'. Split on path separators so only '..' as a real path
component is rejected.
Hoist processQueue and doAutoinstall to module scope and pass
context/repoDir/logger/useLock through the queue entry instead of
relying on the first caller's closure. Removes a latent foot-gun if
anyone later adds per-run autoinstall/repoDir overrides or proxies
AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents.
Pure whitespace; line-wrap long template strings and object literals
so `pnpm test:format` (CI) passes.
@stuartc stuartc requested a review from josephjclark May 22, 2026 13:48
@stuartc
Copy link
Copy Markdown
Member Author

stuartc commented May 22, 2026

Thanks @josephjclark - all fair points, addressed.

Lock lives in the engine now. New file packages/engine-multi/src/util/repo-lock.ts exports a single withInstallLock(repoDir, alias, logger, fn) helper - pure lock primitive, nothing else. Autoinstall wraps its existing install call with it. The worker doesn't touch lock files at all any more; it just flips autoinstall.lockRepo via a --no-repo-lock flag. Lock is on by default.

Sentinels are gone. Those were the .sentinels/<alias>.done marker files - written atomically once npm install returned successfully, so the install handler could check them as a "this finished cleanly" signal. They're redundant: npm extracts each package atomically, so the presence of node_modules/<alias>/package.json is enough. I've tightened the engine's existing isInstalled to AND its package.json dependency check with a stat of that file - so it's now a check of disk state rather than just npm's book-keeping. That catches manual node_modules cleanup, NFS visibility lag on shared volumes, and probably some mid-install crash cases too (haven't dug into npm's exact write ordering to be sure). Benefits every deployment, not just locked ones.

No duplicated isInstalled. Engine's isInstalled is the single source of truth now - initially this ws-worker version had its own sentinel + node_modules check inline, which duplicated the engine's pkg-deps check. The in-lock double-check (re-run isInstalled after acquiring, so a worker that queued behind another one's install sees the install has completed during the wait and skips its own attempt) is preserved - just calling the engine's check rather than a parallel implementation.

ensureDirs IIFE is gone - replaced with a small ensureLockTarget(target) helper that mkdirs the parent and touches the lock file. No IIFE-on-first-call dance.

Two files as you suggested: util/repo-lock.ts (lock primitive) and api/autoinstall.ts (orchestration + install). Runtime is untouched - engine still calls the runtime's install exactly as before, just inside a lock.

A few smaller things came out of the review:

  • Security/alias guard in withInstallLock - refuses aliases where .. appears as a path segment, or that are absolute, on top of the upstream whitelist.
  • Contention is logged at info level (one line, only when actually contended) so ops can see why a worker is sat waiting on cold start without needing debug logs on.
  • Dropped a dead pending map in autoinstall.ts left behind by your 2023 single-queue refactor - declared but never written to, just a stray delete pending[a] in the error path.
  • Caught a yargs auto-negation bug while wiring the flag up: --no-repo-lock was being parsed as { repoLock: false } rather than { noRepoLock: true }, so the original wiring was a no-op. Reversed the option and let yargs do the negation properly. Tests now cover all four combinations of CLI flag + env var.
  • Pinned per-call options (context/repoDir/logger/useLock) on the autoinstall queue entry. They were captured in the first caller's closure under the original code, which is fine today (engine is a singleton with stable options) but would silently misbehave the moment anyone wires per-run autoinstall overrides or proxies AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents. Cheap to fix now, expensive to debug later.

Changeset updated to declare both @openfn/engine-multi and @openfn/ws-worker as minors.

@stuartc stuartc marked this pull request as ready for review May 23, 2026 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Worker: allow multiple workers to safely share a repo directory

3 participants