feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163joanreyero wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.
Changes:
- Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
- Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
- Extends packages-db schema with
docker_checked_at, backlog/staleness indexes, andrepo_docker_pulls_daily(range-partitioned).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/src/dockerhub/index.ts | Core discovery/refresh loop, rate-limit parking, page processing |
| services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts | Docker Hub API client + error classification |
| services/apps/packages_worker/src/dockerhub/detectDockerfile.ts | GitHub GraphQL probe for Dockerfile presence |
| services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts | Upserts into repo_docker and daily snapshot table |
| services/apps/packages_worker/src/dockerhub/types.ts | Shared types + FetchError |
| services/apps/packages_worker/src/dockerhub/candidates.ts | Candidate image-name generation + validation |
| services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts | Unit tests for Hub fetch behavior |
| services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts | Unit tests for candidate generation |
| services/apps/packages_worker/src/bin/dockerhub-sync.ts | Worker entrypoint and shutdown wiring |
| services/apps/packages_worker/src/config.ts | Adds getDockerhubConfig() env parsing |
| services/apps/packages_worker/package.json | Adds start/dev scripts for dockerhub-sync (protected file) |
| scripts/services/dockerhub-sync.yaml | Compose service for running dockerhub-sync |
| backend/src/osspckgs/migrations/V1779710880__initial_schema.sql | Schema: docker_checked_at, indexes, repo_docker_pulls_daily table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Standalone loop worker (modeled on github-repos-enricher) that: - discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name> probing on hub.docker.com/v2 - refreshes pull/star counts daily into repo_docker - snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time daily granularity Schema (V1779710880 edited in place, pre-prod): - repos.docker_checked_at + partial index for discovery backlog - repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily) - repo_docker_stale_idx on last_synced_at Tested against a 1000-repo random sample from prod public.repositories: 2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
6fb6051 to
b23c298
Compare
- Retry the same row after a GitHub rate-limit park instead of abandoning it (cursor would otherwise advance past unprobed repos until end-of-sweep). - Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out cannot fire concurrent requests against the per-IP Hub budget. - 401/403 from Hub now classified AUTH and propagated, so a misconfigured base URL fails fast instead of silently marking every image gone. - Stop discarding valid 200 responses when x-ratelimit-remaining=0. - Wrap repo_docker + repo_docker_pulls_daily writes in a transaction. - Classify non-JSON GitHub GraphQL bodies as MALFORMED. Not addressed (replied on PR): - Inline SQL stays per packages_worker convention (matches enricher/osv). - repo_docker_pulls_daily partition setup deferred to pg_partman, same as downloads_daily in the same migration. - Loop-level retry/parking tests deferred; validated against 1065 real repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Per themarolt's review on #4149, packages-db queries belong in services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow, upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from @crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx orchestrator. Query strings unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Hub calls are serialized via hubChain, so a stalled socket would block all subsequent probes indefinitely. AbortSignal.timeout(30s) on both the Hub and GitHub GraphQL requests; aborts surface as TRANSIENT and retry with backoff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Conflict in packages_worker/package.json scripts: kept both sides; moved dockerhub-sync inspector port 9235 -> 9238 (deps-dev-ingest took 9235 on main). getDockerhubConfig still reads ENRICHER_GITHUB_TOKENS directly so the enricher-v2 GitHub-App switch on main doesn't affect this worker; migrating dockerhub-sync to getGithubAppConfig() is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig + resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out runs one worker per installation id with parkedUntil keyed on installationId. getInstallationToken is called per-request so token refresh/caching is shared with the enricher. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
githubFetchWithRetries was returning null on AUTH (same bucket as NOT_FOUND), which caused discoverRepo to mark docker_checked_at and move on. With a bad installation token that would silently stamp every repo for DOCKERHUB_DISCOVERY_INTERVAL_DAYS. Now AUTH re-throws through processDiscoveryPage so the worker exits and restarts with a fresh resolveInstallations() — symmetric to the Hub AUTH path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cbf3f6c. Configure here.
| throw new FetchError( | ||
| 'TRANSIENT', | ||
| `GraphQL error for ${owner}/${name}: ${err.message ?? err.type}`, | ||
| ) |
There was a problem hiding this comment.
GraphQL limits misclassified as transient
Medium Severity
detectDockerfile only treats GraphQL errors[0].type === 'RATE_LIMITED' as a rate limit and does not check errors[0].message for “rate limit” or “ip allow list”, unlike fetchLightRepo. Those cases become TRANSIENT, exhaust retries, then discovery marks the repo checked as a skip instead of parking or failing fast on systemic auth/network policy.
Reviewed by Cursor Bugbot for commit cbf3f6c. Configure here.
| // Docker Hub's anonymous rate limit is per-IP, not per-token. hubParkedUntil | ||
| // is module state so the park survives across refresh and discovery pages, and | ||
| // hubChain serializes calls so the per-token GitHub fan-out below can't fire | ||
| // concurrent Hub requests against that single per-IP budget. |
| baseUrl: string, | ||
| imageName: string, | ||
| ): Promise<DockerhubRepoResult> { | ||
| const url = `${baseUrl}/repositories/${imageName}/` |
| -- Partitioned monthly via pg_partman — same setup as downloads_daily; add a | ||
| -- partman.create_parent('public.repo_docker_pulls_daily', 'date', '1 month', 3) | ||
| -- call alongside the downloads_daily registration. | ||
| -- ============================================================ |
| -- Last time dockerhub-sync probed this repo for a published Docker image (Dockerfile | ||
| -- detection + Hub candidate lookup). NULL = never checked. Separate from last_synced_at | ||
| -- because discovery cadence (weeks) differs from light-metadata refresh cadence (daily). | ||
| docker_checked_at timestamptz, |
There was a problem hiding this comment.
Initial schema cant be changed anymore as it's already deployed - need to create a new migration now. There is cli scaffold create-packages-migration <name> already prepared to use.
| for (const row of rows) { | ||
| const result = await hubFetchWithRetries(config.hubBaseUrl, row.image_name) | ||
| if (result) { | ||
| await upsertRepoDocker(qx, row.repo_id, result) |
There was a problem hiding this comment.
call to upsertRepoDocker can throw on db failure. In processRefreshPage there is no try/catch, so the error propagates through runDockerhubLoop and then to main().catch → process.exit(1) — the worker crashes.
There was a problem hiding this comment.
processDiscoveryPage handles this but here it's not handled


Summary
Standalone loop worker (sibling of
github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.Discovery (Option B-lite): one GitHub GraphQL call per repo checks for
DockerfileatHEAD:Dockerfile,docker/Dockerfile,build/Dockerfile. If present, probeshub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted torepo_docker; every repo getsrepos.docker_checked_atset so the backlog drains.Refresh: known images with stale
last_synced_atare re-fetched daily; lifetimepull_countis written torepo_docker.pullsand snapshotted intorepo_docker_pulls_daily(per-day deltas viaLAG()at query time — Hub doesn't expose daily counts).Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across
ENRICHER_GITHUB_TOKENSwith per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).Schema (V1779710880 edited in place — pre-prod)
repos.docker_checked_at timestamptz+ partial indexrepos_docker_pending_idx(WHERE host='github' AND docker_checked_at IS NULL)repo_docker_pulls_daily(image_name, date, pulls_total)partitioned by date (register with pg_partman alongsidedownloads_daily)repo_docker_stale_idxonrepo_docker(last_synced_at)Files
src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts+ 15 vitest casessrc/bin/dockerhub-sync.ts,src/config.ts(getDockerhubConfig)scripts/services/dockerhub-sync.yaml,package.jsonscripts (port 9235)Validation against prod data
Ran against a random 1000-repo sample from
public.repositories(prod):<owner>/<name>on Hub<owner>/<name>on Hub7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds:
ollama/ollama(140M pulls),hashicorp/packer(47M),semgrep/semgrep(32M).Follow-ups (scoped, not in this PR)
A CI-workflow-parsing census on the same 1000 repos showed:
scaleway/cli,qmcgaw/gluetun,nervos/ckb,paketobuildpacks/*, …)Ranked by ROI:
registrycolumn onrepo_docker. ~2× total coverage..github/workflowsextraction. +17 Hub images/1000.library/official-image allowlist; broader Dockerfile path probing.Reviewer notes
backend/.env.dist.{local,composed}need theDOCKERHUB_*block appended (couldn't write.env*from the dev session — see commit message for values).pnpm formatinpackages_workeris currently broken (strips TS generics);format-checkfails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.package.jsonchange is +3 script entries only.🤖 Generated with Claude Code
Note
Medium Risk
Touches packages-db schema (in-place migration) and a long-running worker that hammers GitHub GraphQL and anonymous Docker Hub rate limits; operational misconfig (missing pg_partman parent for repo_docker_pulls_daily) can fail snapshot writes until partitions exist.
Overview
Adds a
dockerhub-syncloop worker (likegithub-repos-enricher) that links GitHub repos in packages-db to Docker Hub images and keeps pull metrics fresh.Schema extends the pre-prod osspckgs migration:
repos.docker_checked_at, partialrepos_docker_pending_idx,repo_docker_stale_idx, and monthly-partitionedrepo_docker_pulls_daily(lifetimepull_countsnapshots; daily deltas viaLAG()at read time).DOCKERHUB_*env vars andscripts/services/dockerhub-sync.yamlwire prod/dev containers.Worker behavior: each cycle refreshes stale
repo_dockerrows from Hub API v2, then discovers new mappings—GraphQL Dockerfile probe at three paths, validatedowner/repoHub slug (nolibrary/heuristic), upsert + daily snapshot in a transaction. GitHub discovery fans out across App installation tokens with rate-limit retry; Hub calls are serialized with per-IP parking. DAL lives inrepoDocker.ts; vitest covers candidates and Hub fetch error taxonomy.Reviewed by Cursor Bugbot for commit cbf3f6c. Bugbot is set up for automated code reviews on this repo. Configure here.