Skip to content

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163

Open
joanreyero wants to merge 7 commits into
mainfrom
feat/CM-1213-dockerhub-sync
Open

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
joanreyero wants to merge 7 commits into
mainfrom
feat/CM-1213-dockerhub-sync

Conversation

@joanreyero
Copy link
Copy Markdown
Contributor

@joanreyero joanreyero commented Jun 3, 2026

Summary

Standalone loop worker (sibling of github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.

Discovery (Option B-lite): one GitHub GraphQL call per repo checks for Dockerfile at HEAD:Dockerfile, docker/Dockerfile, build/Dockerfile. If present, probes hub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted to repo_docker; every repo gets repos.docker_checked_at set so the backlog drains.

Refresh: known images with stale last_synced_at are re-fetched daily; lifetime pull_count is written to repo_docker.pulls and snapshotted into repo_docker_pulls_daily (per-day deltas via LAG() at query time — Hub doesn't expose daily counts).

Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across ENRICHER_GITHUB_TOKENS with per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).

Schema (V1779710880 edited in place — pre-prod)

  • repos.docker_checked_at timestamptz + partial index repos_docker_pending_idx (WHERE host='github' AND docker_checked_at IS NULL)
  • repo_docker_pulls_daily(image_name, date, pulls_total) partitioned by date (register with pg_partman alongside downloads_daily)
  • repo_docker_stale_idx on repo_docker(last_synced_at)

Files

  • src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts + 15 vitest cases
  • src/bin/dockerhub-sync.ts, src/config.ts (getDockerhubConfig)
  • scripts/services/dockerhub-sync.yaml, package.json scripts (port 9235)

Validation against prod data

Ran against a random 1000-repo sample from public.repositories (prod):

Outcome n %
Hit — <owner>/<name> on Hub 26 2.6%
Dockerfile present, no <owner>/<name> on Hub 102 10.2%
No Dockerfile at probed paths 869 86.9%
GitHub 404 3 0.3%

7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds: ollama/ollama (140M pulls), hashicorp/packer (47M), semgrep/semgrep (32M).

Follow-ups (scoped, not in this PR)

A CI-workflow-parsing census on the same 1000 repos showed:

  • 66 repos (6.6%) have GHA that publishes a container
  • Registry split: ghcr.io 41 · docker.io 35 · quay.io 8 — ghcr is the dominant target for CDP's LF/CNCF-heavy population
  • CI parsing recovers +17 Hub images v1 misses (org/name differs: scaleway/cli, qmcgaw/gluetun, nervos/ckb, paketobuildpacks/*, …)
  • 31 repos publish only to non-Hub registries

Ranked by ROI:

  1. ghcr.io/quay.io probes — ghcr namespaces == GitHub orgs by design, so the existing heuristic works as-is; needs a registry column on repo_docker. ~2× total coverage.
  2. CI-workflow parsing — replace Dockerfile gate with .github/workflows extraction. +17 Hub images/1000.
  3. library/ official-image allowlist; broader Dockerfile path probing.

Reviewer notes

  • backend/.env.dist.{local,composed} need the DOCKERHUB_* block appended (couldn't write .env* from the dev session — see commit message for values).
  • pnpm format in packages_worker is currently broken (strips TS generics); format-check fails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.
  • package.json change is +3 script entries only.

🤖 Generated with Claude Code


Note

Medium Risk
Touches packages-db schema (in-place migration) and a long-running worker that hammers GitHub GraphQL and anonymous Docker Hub rate limits; operational misconfig (missing pg_partman parent for repo_docker_pulls_daily) can fail snapshot writes until partitions exist.

Overview
Adds a dockerhub-sync loop worker (like github-repos-enricher) that links GitHub repos in packages-db to Docker Hub images and keeps pull metrics fresh.

Schema extends the pre-prod osspckgs migration: repos.docker_checked_at, partial repos_docker_pending_idx, repo_docker_stale_idx, and monthly-partitioned repo_docker_pulls_daily (lifetime pull_count snapshots; daily deltas via LAG() at read time). DOCKERHUB_* env vars and scripts/services/dockerhub-sync.yaml wire prod/dev containers.

Worker behavior: each cycle refreshes stale repo_docker rows from Hub API v2, then discovers new mappings—GraphQL Dockerfile probe at three paths, validated owner/repo Hub slug (no library/ heuristic), upsert + daily snapshot in a transaction. GitHub discovery fans out across App installation tokens with rate-limit retry; Hub calls are serialized with per-IP parking. DAL lives in repoDocker.ts; vitest covers candidates and Hub fetch error taxonomy.

Reviewed by Cursor Bugbot for commit cbf3f6c. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings June 3, 2026 13:40
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.

Changes:

  • Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
  • Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
  • Extends packages-db schema with docker_checked_at, backlog/staleness indexes, and repo_docker_pulls_daily (range-partitioned).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
services/apps/packages_worker/src/dockerhub/index.ts Core discovery/refresh loop, rate-limit parking, page processing
services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Docker Hub API client + error classification
services/apps/packages_worker/src/dockerhub/detectDockerfile.ts GitHub GraphQL probe for Dockerfile presence
services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts Upserts into repo_docker and daily snapshot table
services/apps/packages_worker/src/dockerhub/types.ts Shared types + FetchError
services/apps/packages_worker/src/dockerhub/candidates.ts Candidate image-name generation + validation
services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts Unit tests for Hub fetch behavior
services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts Unit tests for candidate generation
services/apps/packages_worker/src/bin/dockerhub-sync.ts Worker entrypoint and shutdown wiring
services/apps/packages_worker/src/config.ts Adds getDockerhubConfig() env parsing
services/apps/packages_worker/package.json Adds start/dev scripts for dockerhub-sync (protected file)
scripts/services/dockerhub-sync.yaml Compose service for running dockerhub-sync
backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Schema: docker_checked_at, indexes, repo_docker_pulls_daily table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/__tests__/fetchDockerhub.test.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
Comment thread backend/src/osspckgs/migrations/V1779710880__initial_schema.sql
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Standalone loop worker (modeled on github-repos-enricher) that:
- discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name>
  probing on hub.docker.com/v2
- refreshes pull/star counts daily into repo_docker
- snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time
  daily granularity

Schema (V1779710880 edited in place, pre-prod):
- repos.docker_checked_at + partial index for discovery backlog
- repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily)
- repo_docker_stale_idx on last_synced_at

Tested against a 1000-repo random sample from prod public.repositories:
2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant
registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as
follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
@joanreyero joanreyero force-pushed the feat/CM-1213-dockerhub-sync branch from 6fb6051 to b23c298 Compare June 3, 2026 15:42
- Retry the same row after a GitHub rate-limit park instead of abandoning it
  (cursor would otherwise advance past unprobed repos until end-of-sweep).
- Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out
  cannot fire concurrent requests against the per-IP Hub budget.
- 401/403 from Hub now classified AUTH and propagated, so a misconfigured base
  URL fails fast instead of silently marking every image gone.
- Stop discarding valid 200 responses when x-ratelimit-remaining=0.
- Wrap repo_docker + repo_docker_pulls_daily writes in a transaction.
- Classify non-JSON GitHub GraphQL bodies as MALFORMED.

Not addressed (replied on PR):
- Inline SQL stays per packages_worker convention (matches enricher/osv).
- repo_docker_pulls_daily partition setup deferred to pg_partman, same as
  downloads_daily in the same migration.
- Loop-level retry/parking tests deferred; validated against 1065 real repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 3, 2026 16:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/detectDockerfile.ts
Comment thread services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts
Per themarolt's review on #4149, packages-db queries belong in
services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now
imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow,
upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from
@crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx
orchestrator. Query strings unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Hub calls are serialized via hubChain, so a stalled socket would block all
subsequent probes indefinitely. AbortSignal.timeout(30s) on both the Hub and
GitHub GraphQL requests; aborts surface as TRANSIENT and retry with backoff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 5, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.

Comment thread services/apps/packages_worker/src/dockerhub/index.ts Outdated
joanreyero and others added 2 commits June 5, 2026 15:14
Conflict in packages_worker/package.json scripts: kept both sides; moved
dockerhub-sync inspector port 9235 -> 9238 (deps-dev-ingest took 9235 on main).

getDockerhubConfig still reads ENRICHER_GITHUB_TOKENS directly so the
enricher-v2 GitHub-App switch on main doesn't affect this worker; migrating
dockerhub-sync to getGithubAppConfig() is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the
ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig +
resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out
runs one worker per installation id with parkedUntil keyed on installationId.
getInstallationToken is called per-request so token refresh/caching is shared
with the enricher.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 5, 2026 14:19
@joanreyero joanreyero requested review from epipav and themarolt June 5, 2026 14:20
githubFetchWithRetries was returning null on AUTH (same bucket as NOT_FOUND),
which caused discoverRepo to mark docker_checked_at and move on. With a bad
installation token that would silently stamp every repo for
DOCKERHUB_DISCOVERY_INTERVAL_DAYS. Now AUTH re-throws through
processDiscoveryPage so the worker exits and restarts with a fresh
resolveInstallations() — symmetric to the Hub AUTH path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cbf3f6c. Configure here.

throw new FetchError(
'TRANSIENT',
`GraphQL error for ${owner}/${name}: ${err.message ?? err.type}`,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphQL limits misclassified as transient

Medium Severity

detectDockerfile only treats GraphQL errors[0].type === 'RATE_LIMITED' as a rate limit and does not check errors[0].message for “rate limit” or “ip allow list”, unlike fetchLightRepo. Those cases become TRANSIENT, exhaust retries, then discovery marks the repo checked as a skip instead of parking or failing fast on systemic auth/network policy.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cbf3f6c. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Comment on lines +28 to +31
// Docker Hub's anonymous rate limit is per-IP, not per-token. hubParkedUntil
// is module state so the park survives across refresh and discovery pages, and
// hubChain serializes calls so the per-token GitHub fan-out below can't fire
// concurrent Hub requests against that single per-IP budget.
baseUrl: string,
imageName: string,
): Promise<DockerhubRepoResult> {
const url = `${baseUrl}/repositories/${imageName}/`
Comment on lines +583 to +586
-- Partitioned monthly via pg_partman — same setup as downloads_daily; add a
-- partman.create_parent('public.repo_docker_pulls_daily', 'date', '1 month', 3)
-- call alongside the downloads_daily registration.
-- ============================================================
-- Last time dockerhub-sync probed this repo for a published Docker image (Dockerfile
-- detection + Hub candidate lookup). NULL = never checked. Separate from last_synced_at
-- because discovery cadence (weeks) differs from light-metadata refresh cadence (daily).
docker_checked_at timestamptz,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial schema cant be changed anymore as it's already deployed - need to create a new migration now. There is cli scaffold create-packages-migration <name> already prepared to use.

for (const row of rows) {
const result = await hubFetchWithRetries(config.hubBaseUrl, row.image_name)
if (result) {
await upsertRepoDocker(qx, row.repo_id, result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call to upsertRepoDocker can throw on db failure. In processRefreshPage there is no try/catch, so the error propagates through runDockerhubLoop and then to main().catch → process.exit(1) — the worker crashes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processDiscoveryPage handles this but here it's not handled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants