feat: pom fetcher#4144
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.
Changes:
- Added
@crowd/data-access-layerosspckgs module with queries for Maven enrichment candidates and upserts intopackages,maintainers, andpackage_maintainers. - Added a
pom-fetcherworker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors). - Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/types.ts | Adds DB-facing types for osspckgs package/maintainer upserts and universe rows. |
| services/libs/data-access-layer/src/osspckgs/packages.ts | Adds query to list Maven universe packages needing enrichment + upsert into packages. |
| services/libs/data-access-layer/src/osspckgs/maintainers.ts | Adds upserts for maintainers and package_maintainers. |
| services/libs/data-access-layer/src/osspckgs/index.ts | Re-exports osspckgs DAL surface. |
| services/libs/data-access-layer/src/index.ts | Exposes osspckgs DAL from the package root. |
| services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts | Implements batch/concurrent enrichment loop and persistence of extracted metadata. |
| services/apps/packages_worker/src/pom-fetcher/metadata.ts | Resolves latest version via maven-metadata.xml. |
| services/apps/packages_worker/src/pom-fetcher/extract.ts | Fetches POMs and extracts fields with limited parent inheritance traversal. |
| services/apps/packages_worker/src/config.ts | Adds pom-fetcher config loader. |
| services/apps/packages_worker/src/bin/pom-fetcher.ts | Adds runnable entrypoint with shutdown handling. |
| services/apps/packages_worker/package.json | Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher. |
| scripts/services/pom-fetcher.yaml | Adds docker-compose service definition for pom-fetcher. |
| pnpm-lock.yaml | Updates lockfile for new deps (but includes an unexpected workspace importer). |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
| dependentReposCount: pkg.dependentReposCount, | ||
| }) | ||
| log.debug({ groupId, artifactId, version }, 'Version unchanged — skipping POM extraction') | ||
| return { status: 'unchanged', hopLimitReached: false } |
There was a problem hiding this comment.
Unchanged sync never clears queue
High Severity
When upstream release matches latest_version, the worker calls touchPackageSyncedAt but leaves ingestion_source unchanged. listMavenPackagesToSync keeps selecting those rows whenever ingestion_source is not a Maven worker outcome, regardless of fresh last_synced_at, so the same critical packages are metadata-polled every batch indefinitely.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 40f8596. Configure here.
| /** | ||
| * Fetches maven-metadata.xml for a Maven artifact and returns the full version | ||
| * list plus the current release version. | ||
| * | ||
| * URL format: | ||
| * https://repo1.maven.org/maven2/{groupPath}/{artifactId}/maven-metadata.xml | ||
| * | ||
| * Returns null when the artifact is not found (404) or the metadata is | ||
| * malformed. | ||
| */ |
| /** | ||
| * Core POM extraction logic — pure functions (no I/O side-effects, no DB calls). | ||
| * Callers are responsible for concurrency, retries, and persistence. | ||
| */ |
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
…by run mode Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
| parent.artifactId, | ||
| parent.version, | ||
| depth + 1, | ||
| ) |
There was a problem hiding this comment.
Unbounded parent POM recursion
High Severity
resolveWithInheritance follows parent POMs whenever licenses or SCM are missing, with no maximum depth or visited-coordinate guard. A cyclic or very deep parent chain can recurse until stack overflow or unbounded HTTP, despite the feature’s stated parent-hop limit.
Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.
| }) | ||
|
|
||
| export async function mavenCriticalWorkflow(): Promise<void> { | ||
| await processMavenCriticalBatch() |
There was a problem hiding this comment.
Temporal activity batch timeout risk
Medium Severity
processMavenCriticalBatch can process up to MAVEN_FETCHER_BATCH_SIZE (2000) critical packages with HTTP POM work per activity, while startToCloseTimeout is only 15 minutes. A backlog needing full extraction can exceed that window and fail the activity mid-batch.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.
| declared_repository_url = COALESCE(EXCLUDED.declared_repository_url, packages.declared_repository_url), | ||
| repository_url = COALESCE(EXCLUDED.repository_url, packages.repository_url), | ||
| licenses = COALESCE(EXCLUDED.licenses, packages.licenses), | ||
| licenses_raw = COALESCE(EXCLUDED.licenses_raw, packages.licenses_raw), |
There was a problem hiding this comment.
Null upserts never clear fields
Medium Severity
upsertPackage updates nullable columns with COALESCE(EXCLUDED.*, packages.*), so a full Maven sync that passes null for licenses, SCM URLs, or description cannot clear values already stored from an earlier sync.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.
| /** | ||
| * Core POM extraction logic — pure functions (no I/O side-effects, no DB calls). | ||
| * Callers are responsible for concurrency, retries, and persistence. | ||
| */ |
| // with transient errors — we never do it. Maven coordinates are immutable, so a cached | ||
| // POM never goes stale; the LRU size cap is purely to bound memory. | ||
|
|
||
| const POM_CACHE_MAX_ENTRIES = 5_000 |
| const missingScm = !scmUrl | ||
| const parent = extractParent(pom) | ||
|
|
||
| if (parent && (missingLicense || missingScm)) { |
| "monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=monitor tsx src/scripts/monitorOsspckgs.ts'", | ||
| "trigger-bootstrap": "SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts", | ||
| "trigger-bootstrap:local": "set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts", | ||
| "monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && node ../../../scripts/monitor-osspckgs.mjs'", |
| // waiting for the staleness window. Errors/skips re-run only once stale, so a | ||
| // broken package isn't retried every pass. |
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 8 total unresolved issues (including 7 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1ffbc9e. Configure here.
| const durationSec = Math.round((Date.now() - phaseStartedAt) / 1000) | ||
| log.info({ phase: label, ...total, durationSec }, 'Phase complete') | ||
| return total | ||
| } |
There was a problem hiding this comment.
Backfill loops on repeated failures
High Severity
runMavenCriticalBackfill keeps calling processBatch while any batch reports work, but metadata rate limits (and other paths that return error without updating the row) leave packages permanently “due” in listMavenPackagesToSync. The backfill then re-selects the same batch in a tight loop with no delay, hammering Maven Central and never reaching an empty batch to exit.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1ffbc9e. Configure here.


Summary
Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.
Changes
Type of change
Note
Medium Risk
Large new ingestion path writes broadly to packages-db under concurrent HTTP load and a per-minute Temporal schedule; mitigations (idempotent upserts, skip-unchanged, rate-limit handling) are present but operational load and lock contention on shared maintainers/repos remain possible.
Overview
Adds Maven Central enrichment to
packages_worker: critical Tier-2 packages are synced from Maven metadata and POMs intopackages-db(packages, versions, maintainers, repos, audit), aligned with the existing npm/OSV worker pattern.Runtime: A Temporal
maven-criticalschedule (every minute, overlap skip) runsprocessMavenCriticalBatch, which uses repo1 and skips full POM work whenlatest_versionis unchanged.pnpm backfill:mavenruns a one-shot drain with full extraction via the GCS mirror base URL. A dedicatedmaven-workerentrypoint registers only the Maven schedule for local isolation.Fetching:
maven-metadata.xmldrives release/version selection (stable over prerelease); POM parsing follows parent chains (up to 8 hops) with an in-process LRU + in-flight coalescing cache, rate-limit retries, namespace-ordered batches, and transactional upserts with deadlock retry.Data layer: New
osspckgsDAL helpers (listMavenPackagesToSync,upsertPackage, batch versions, maintainers, repos) andgetMavenConfig/ env vars (MAVEN_FETCHER_*). Dependenciesaxiosandfast-xml-parser; unit tests for normalization helpers. Non-critical Maven batch + schedule remain implemented but not registered on the main worker.Reviewed by Cursor Bugbot for commit 1ffbc9e. Bugbot is set up for automated code reviews on this repo. Configure here.