Integrate registry cron scripts into the job queue #715

f-f · 2026-01-11T23:08:11Z

This PR integrates into the Job Queue framework the three "Registry scripts" that we run daily to keep data tidy: the package transferrer, the package set upgrades, and the legacy importer - as detailed here

The patch is simpler than I though it would be, and I think that's mainly because we only deal with packages that are already in the registry, rather than packages that could possibly be missing. As a result, the package transfer script and the package set updater are deprecated, while the legacy importer stays in place since it's the only way we have to bootstrap the registry from scratch at the moment.

app-e2e/src/Test/Main.purs

app/src/App/API.purs

f-f · 2026-01-11T23:17:37Z

app/src/App/Main.purs

+      -- Initialize registry repo before launching parallel processes, to avoid
+      -- race condition where both Scheduler and Job Executor try to clone the
+      -- Registry at the same time
+      void $ runEffects env do
+        Log.info "Initializing registry repo..."
+        Registry.readAllMetadata
+      liftEffect do
+        case env.vars.resourceEnv.healthchecksUrl of
+          Nothing -> Console.log "HEALTHCHECKS_URL not set, healthcheck pinging disabled"
+          Just healthchecksUrl -> Aff.launchAff_ $ healthcheck healthchecksUrl
+        Aff.launchAff_ $ withRetryLoop "Scheduler" $ Scheduler.runScheduler env
+        Aff.launchAff_ $ withRetryLoop "Job executor" $ JobExecutor.runJobExecutor env
+        Router.runRouter env


Right, this was a bit annoying - I added another fiber to schedule jobs, and that resulted in two fibers pulling the registry at the same time, and having a race condition around the creation of the folder. I am not sure what's the best way to go at it - I thought about merging the two loops (i.e. pull jobs unless 12h have passed from the last time we run the crons, in which case loop through the registry/github and enqueue stuff, then go back to running jobs), but the package importer might just take a long time since it's hitting github for each package, and we can't afford that kind of slowdown. So for now I am synchronously pulling the registry first thing, but I am not a big fan of this since it doesn't prevent more race conditions in the future.

Good catch on this race condition. In short, both the Scheduler and JobExecutor fibers call into the REGISTRY effect, which triggers git clone or pulls, and in this specific case when both fibers start together they race to create the same directory, causing failures.

We could merge the two loops. The daily update job could end up taking a long time though, as you said, because it fetches tags from all repos and may get rate-limited. That said, if you just want to get this PR over the line perhaps it's acceptable for now. I wrote out a more full solution below but it's a bit riskier.

I think the proper solution here is per-repo locks where we add a mutex per git repository inside the REGISTRY handler, and all git operations for a given repo go through this lock. It has some advantages, but also we start running up against the limitations of run as an effects system.

The classic solution would be to use AVar as a way to produce locks in Aff, and use Aff.bracket to acquire the lock, run the action, and release the lock on completion (this is exception-safe, so on exception the lock would also be released).

However, we can't use Aff.bracket because it would require the actions to be run in Aff and our actions intersperse other effects like LOG. You can't enter Aff and then "re-enter" Run like that. There are workarounds but they're very ugly.

So that leaves us with more manual lock management, which ordinarily is a bummer but in our case is actually not too bad. The basic idea would be:

Add per-repo locks to the registry env, keyed by process in registry.purs

The point of keying by process is for cleanup in the case of an exception -- described later.

data Process = Scheduler | JobExecutor type RepoLock = { lock :: AVar Unit, owner :: Ref (Maybe ProcessId) } type RepoLocks = Ref (Map RepoKey RepoLock) type RegistryEnv = { ..., repoLocks :: RepoLocks }

Add a withRepoLock helper function

withRepoLock :: forall r a. ProcessId -> RepoLocks -> RepoKey -> Run (LOG + AFF + EFFECT + r) a -> Run _ a withRepoLock process locks key action = do lock <- getOrCreateLock locks key Run.liftAff $ AVar.take repoLock.lock Run.liftEffect $ Ref.modify_ (_ { owner = Just process }) repoLock result <- action Run.liftEffect $ Ref.modify_ (_ { owner = Nothing }) repoLock.owner Run.liftAff $ AVAr.put unit repoLock.lock pure result

Use the withRepoLock function in the registry handler

pull key = withRepoLock processId env.repoLocks key do -- existing logic

Clean up locks on startup/job failure

The basic approach to exception safety is that an exception will kill the fiber (fibers can also be killed by the parent, e.g. on a timeout). Then, the executor/scheduler loops get restarted. At that point we can clear any held locks for the process which restarts -- this avoids the case where an exception leaves the lock held, and then it remains stale.

clearOwnLocks :: ProcessId -> RepoLocks -> Run _ Unit clearOwnLocks process locksRef = do locks <- readLocks for_ (Map.toUnfoldable locks :: Array _) \(Tuple key repoLock) -> do { owner } <- Run.liftEffect $ Ref.read repoLock when (process == Just owner) do Log.warn $ "Clearing orphaned lock for " <> process -- clear the lock

When the loops start or fail, clear the locks

runJobExecutor env = runEffects env do clearOwnLocks JobExecutor env.registryEnv.repoLocks Log.info "Starting ..." loop

When we kill a fiber, same deal:

jobResult <- liftAff do let timeout = Aff.delay (Milliseconds delay) $> Nothing ... case jobResult of Nothing -> do Log.error "timed out" clearOwnLocks JobExecutor ... pure false Just (Left _) -> do Log.error "died" clearOwnLocks ...

This should work in each scenario:

Normal completion: the lock is released, as per withRepoLock

Job timeout or error, clearOwnLocks runs on the job result.

Executor/scheduler crashes, gets restarted, clearOwnLocks on startup

One last thought: clearing locks is for an exceptional scenario where something has gone wrong. We have other consequences besides stale locks that can happen here — for example, what if we were in the middle of a clone? Then we'll end up with a corrupted repo and the registry will be borked.

So we might want a more general cleanup function that does more than clear locks. It could also run gitCLI [ "rev-parse", "--git-dir" ] to check each repo is valid and delete it if it isn't. (On pull we already do a force-clean, so stale state is OK).

Good writeup! I think we should just add this locking functionality, introducing the scheduler pretty much guarantees race conditions with any jobs that are running.

I have applied this suggestion, but I think it needs a bit more work

app/src/App/Server/Scheduler.purs

thomashoneyman · 2026-01-13T00:50:35Z

app/src/App/Server/Scheduler.purs

+    Nothing -> do
+      -- Use the lowest compiler from previous version for compatibility,
+      -- falling back to latest if no previous version exists
+      compiler <- case Map.findMax metadata.published of


This is probably fine but it’s technically possible to publish a package version that dates to an older compiler (like a bug fix to an old series of the package for an earlier compiler), so we have a reliable way to determine the compiler from the resolved packages, as is done in the legacy importer (compilerFromResolutions) and could use that here

Oh, thanks for the pointer, will use that instead!

Whoops, it's actually called findFirstFromResolutions and is a helper function inlined, hopefully not too much trouble to move it into the app somewhere and import it back into the importer. Or it may be too tied-in to the importer caches and you want to just create a new function.

findFirstFromResolutions :: Map PackageName Version -> Run _ (Either (Map Version CompilerFailure) Version) findFirstFromResolutions resolutions = ...

Applied this idea, but I kept the existing logic for the case where the package doesn't have any dependencies. In that case it still makes sense to try to take the same compiler as the previous version, instead of falling back to the latest one.

app/src/App/Server/Scheduler.purs

thomashoneyman

Noticed a few things that need to be resolved before we merge. Also, I went ahead and removed the daily jobs workflow from the registry repository since this replaces it:
purescript/registry#525

app/src/App/Server/Scheduler.purs

db/schema.sql

app/src/App/API.purs

Integrate registry cron scripts into the job queue

9e727f7

f-f requested a review from thomashoneyman January 11, 2026 23:08

Format nix files

2206f6d

f-f commented Jan 11, 2026

View reviewed changes

app-e2e/src/Test/Main.purs Outdated Show resolved Hide resolved

f-f commented Jan 11, 2026

View reviewed changes

app/src/App/API.purs Show resolved Hide resolved

f-f commented Jan 11, 2026

View reviewed changes

thomashoneyman reviewed Jan 13, 2026

View reviewed changes

app/src/App/Server/Scheduler.purs Outdated Show resolved Hide resolved

thomashoneyman reviewed Jan 13, 2026

View reviewed changes

app/src/App/Server/Scheduler.purs Show resolved Hide resolved

thomashoneyman requested changes Jan 13, 2026

View reviewed changes

Reinstate matrix jobs test at startup, cleanup other scheduler tests

0aba96c