-
Notifications
You must be signed in to change notification settings - Fork 80
Integrate registry cron scripts into the job queue #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trh/compilers-in-metadata
Are you sure you want to change the base?
Conversation
app/src/App/Main.purs
Outdated
| -- Initialize registry repo before launching parallel processes, to avoid | ||
| -- race condition where both Scheduler and Job Executor try to clone the | ||
| -- Registry at the same time | ||
| void $ runEffects env do | ||
| Log.info "Initializing registry repo..." | ||
| Registry.readAllMetadata | ||
| liftEffect do | ||
| case env.vars.resourceEnv.healthchecksUrl of | ||
| Nothing -> Console.log "HEALTHCHECKS_URL not set, healthcheck pinging disabled" | ||
| Just healthchecksUrl -> Aff.launchAff_ $ healthcheck healthchecksUrl | ||
| Aff.launchAff_ $ withRetryLoop "Scheduler" $ Scheduler.runScheduler env | ||
| Aff.launchAff_ $ withRetryLoop "Job executor" $ JobExecutor.runJobExecutor env | ||
| Router.runRouter env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this was a bit annoying - I added another fiber to schedule jobs, and that resulted in two fibers pulling the registry at the same time, and having a race condition around the creation of the folder. I am not sure what's the best way to go at it - I thought about merging the two loops (i.e. pull jobs unless 12h have passed from the last time we run the crons, in which case loop through the registry/github and enqueue stuff, then go back to running jobs), but the package importer might just take a long time since it's hitting github for each package, and we can't afford that kind of slowdown. So for now I am synchronously pulling the registry first thing, but I am not a big fan of this since it doesn't prevent more race conditions in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on this race condition. In short, both the Scheduler and JobExecutor fibers call into the REGISTRY effect, which triggers git clone or pulls, and in this specific case when both fibers start together they race to create the same directory, causing failures.
We could merge the two loops. The daily update job could end up taking a long time though, as you said, because it fetches tags from all repos and may get rate-limited. That said, if you just want to get this PR over the line perhaps it's acceptable for now. I wrote out a more full solution below but it's a bit riskier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the proper solution here is per-repo locks where we add a mutex per git repository inside the REGISTRY handler, and all git operations for a given repo go through this lock. It has some advantages, but also we start running up against the limitations of run as an effects system.
The classic solution would be to use AVar as a way to produce locks in Aff, and use Aff.bracket to acquire the lock, run the action, and release the lock on completion (this is exception-safe, so on exception the lock would also be released).
However, we can't use Aff.bracket because it would require the actions to be run in Aff and our actions intersperse other effects like LOG. You can't enter Aff and then "re-enter" Run like that. There are workarounds but they're very ugly.
So that leaves us with more manual lock management, which ordinarily is a bummer but in our case is actually not too bad. The basic idea would be:
- Add per-repo locks to the registry env, keyed by process in registry.purs
The point of keying by process is for cleanup in the case of an exception -- described later.
data Process = Scheduler | JobExecutor
type RepoLock = { lock :: AVar Unit, owner :: Ref (Maybe ProcessId) }
type RepoLocks = Ref (Map RepoKey RepoLock)
type RegistryEnv = { ..., repoLocks :: RepoLocks }- Add a
withRepoLockhelper function
withRepoLock :: forall r a. ProcessId -> RepoLocks -> RepoKey -> Run (LOG + AFF + EFFECT + r) a -> Run _ a
withRepoLock process locks key action = do
lock <- getOrCreateLock locks key
Run.liftAff $ AVar.take repoLock.lock
Run.liftEffect $ Ref.modify_ (_ { owner = Just process }) repoLock
result <- action
Run.liftEffect $ Ref.modify_ (_ { owner = Nothing }) repoLock.owner
Run.liftAff $ AVAr.put unit repoLock.lock
pure result- Use the withRepoLock function in the registry handler
pull key = withRepoLock processId env.repoLocks key do
-- existing logic- Clean up locks on startup/job failure
The basic approach to exception safety is that an exception will kill the fiber (fibers can also be killed by the parent, e.g. on a timeout). Then, the executor/scheduler loops get restarted. At that point we can clear any held locks for the process which restarts -- this avoids the case where an exception leaves the lock held, and then it remains stale.
clearOwnLocks :: ProcessId -> RepoLocks -> Run _ Unit
clearOwnLocks process locksRef = do
locks <- readLocks
for_ (Map.toUnfoldable locks :: Array _) \(Tuple key repoLock) -> do
{ owner } <- Run.liftEffect $ Ref.read repoLock
when (process == Just owner) do
Log.warn $ "Clearing orphaned lock for " <> process
-- clear the lockWhen the loops start or fail, clear the locks
runJobExecutor env = runEffects env do
clearOwnLocks JobExecutor env.registryEnv.repoLocks
Log.info "Starting ..."
loopWhen we kill a fiber, same deal:
jobResult <- liftAff do
let timeout = Aff.delay (Milliseconds delay) $> Nothing
...
case jobResult of
Nothing -> do
Log.error "timed out"
clearOwnLocks JobExecutor ...
pure false
Just (Left _) -> do
Log.error "died"
clearOwnLocks ...This should work in each scenario:
- Normal completion: the lock is released, as per
withRepoLock - Job timeout or error,
clearOwnLocksruns on the job result. - Executor/scheduler crashes, gets restarted,
clearOwnLockson startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thought: clearing locks is for an exceptional scenario where something has gone wrong. We have other consequences besides stale locks that can happen here — for example, what if we were in the middle of a clone? Then we'll end up with a corrupted repo and the registry will be borked.
So we might want a more general cleanup function that does more than clear locks. It could also run gitCLI [ "rev-parse", "--git-dir" ] to check each repo is valid and delete it if it isn't. (On pull we already do a force-clean, so stale state is OK).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good writeup! I think we should just add this locking functionality, introducing the scheduler pretty much guarantees race conditions with any jobs that are running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have applied this suggestion, but I think it needs a bit more work
| Nothing -> do | ||
| -- Use the lowest compiler from previous version for compatibility, | ||
| -- falling back to latest if no previous version exists | ||
| compiler <- case Map.findMax metadata.published of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine but it’s technically possible to publish a package version that dates to an older compiler (like a bug fix to an old series of the package for an earlier compiler), so we have a reliable way to determine the compiler from the resolved packages, as is done in the legacy importer (compilerFromResolutions) and could use that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the pointer, will use that instead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, it's actually called findFirstFromResolutions and is a helper function inlined, hopefully not too much trouble to move it into the app somewhere and import it back into the importer. Or it may be too tied-in to the importer caches and you want to just create a new function.
findFirstFromResolutions :: Map PackageName Version -> Run _ (Either (Map Version CompilerFailure) Version)
findFirstFromResolutions resolutions = ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied this idea, but I kept the existing logic for the case where the package doesn't have any dependencies. In that case it still makes sense to try to take the same compiler as the previous version, instead of falling back to the latest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed a few things that need to be resolved before we merge. Also, I went ahead and removed the daily jobs workflow from the registry repository since this replaces it:
purescript/registry#525
This PR integrates into the Job Queue framework the three "Registry scripts" that we run daily to keep data tidy: the package transferrer, the package set upgrades, and the legacy importer - as detailed here
The patch is simpler than I though it would be, and I think that's mainly because we only deal with packages that are already in the registry, rather than packages that could possibly be missing. As a result, the package transfer script and the package set updater are deprecated, while the legacy importer stays in place since it's the only way we have to bootstrap the registry from scratch at the moment.