Skip to content

feat: pom fetcher#4144

Open
ulemons wants to merge 24 commits into
mainfrom
feat/pom-fetcher
Open

feat: pom fetcher#4144
ulemons wants to merge 24 commits into
mainfrom
feat/pom-fetcher

Conversation

@ulemons
Copy link
Copy Markdown
Contributor

@ulemons ulemons commented May 26, 2026

Summary

Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.

Changes

  • Two-tier fetch strategy — non-critical packages are DB-only (copy universe stats, no HTTP, ~1000 pkg/sec); critical packages get full POM extraction with parent-chain resolution (max 8 hops) for description, homepage, SCM/repo, licenses, maintainers, and the full version list.
  • Two entry points — bin/packages-worker.ts registers the maven-critical Temporal schedule for incremental syncing (skips POM extraction when the version is unchanged), and bin/maven-backfill.ts (pnpm backfill:maven) does a one-shot, resumable full-extraction backfill. The DB state is the cursor, so re-runs pick up where they left off.
  • Module-level parent POM cache (extract.ts) — coordinate-keyed LRU with request coalescing, caches only successful fetches (never null, to avoid poisoning), no TTL since Maven coordinates are immutable. This is the main lever against Maven Central rate limiting and works because the rank_in_ecosystem ordering clusters sibling artifacts that share parent POMs. Exposes getPomCacheStats() for hit-rate observability.
  • New osspckgs data-access-layer module — query functions for packages, versions, maintainers, and repos (functional, pg-promise via queryExecutor), shared across the worker.
  • Delta API support (deltaApi.ts) for incremental upstream change detection, plus benchmark and data-quality validation scripts.
  • Adds unit tests for the pure normalization functions.
  • Maven-specific config in config.ts (POM_FETCHER_REFRESH_DAYS, POM_CACHE_MAX_ENTRIES, etc.) and .env.dist.local entries.

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

Note

Medium Risk
Large new ingestion path writes broadly to packages-db under concurrent HTTP load and a per-minute Temporal schedule; mitigations (idempotent upserts, skip-unchanged, rate-limit handling) are present but operational load and lock contention on shared maintainers/repos remain possible.

Overview
Adds Maven Central enrichment to packages_worker: critical Tier-2 packages are synced from Maven metadata and POMs into packages-db (packages, versions, maintainers, repos, audit), aligned with the existing npm/OSV worker pattern.

Runtime: A Temporal maven-critical schedule (every minute, overlap skip) runs processMavenCriticalBatch, which uses repo1 and skips full POM work when latest_version is unchanged. pnpm backfill:maven runs a one-shot drain with full extraction via the GCS mirror base URL. A dedicated maven-worker entrypoint registers only the Maven schedule for local isolation.

Fetching: maven-metadata.xml drives release/version selection (stable over prerelease); POM parsing follows parent chains (up to 8 hops) with an in-process LRU + in-flight coalescing cache, rate-limit retries, namespace-ordered batches, and transactional upserts with deadlock retry.

Data layer: New osspckgs DAL helpers (listMavenPackagesToSync, upsertPackage, batch versions, maintainers, repos) and getMavenConfig / env vars (MAVEN_FETCHER_*). Dependencies axios and fast-xml-parser; unit tests for normalization helpers. Non-critical Maven batch + schedule remain implemented but not registered on the main worker.

Reviewed by Cursor Bugbot for commit 1ffbc9e. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings May 26, 2026 15:59
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 26, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ mbani01
❌ ulemons
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@ulemons ulemons changed the base branch from main to feat/track-packages May 26, 2026 16:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.

Changes:

  • Added @crowd/data-access-layer osspckgs module with queries for Maven enrichment candidates and upserts into packages, maintainers, and package_maintainers.
  • Added a pom-fetcher worker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors).
  • Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/osspckgs/types.ts Adds DB-facing types for osspckgs package/maintainer upserts and universe rows.
services/libs/data-access-layer/src/osspckgs/packages.ts Adds query to list Maven universe packages needing enrichment + upsert into packages.
services/libs/data-access-layer/src/osspckgs/maintainers.ts Adds upserts for maintainers and package_maintainers.
services/libs/data-access-layer/src/osspckgs/index.ts Re-exports osspckgs DAL surface.
services/libs/data-access-layer/src/index.ts Exposes osspckgs DAL from the package root.
services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Implements batch/concurrent enrichment loop and persistence of extracted metadata.
services/apps/packages_worker/src/pom-fetcher/metadata.ts Resolves latest version via maven-metadata.xml.
services/apps/packages_worker/src/pom-fetcher/extract.ts Fetches POMs and extracts fields with limited parent inheritance traversal.
services/apps/packages_worker/src/config.ts Adds pom-fetcher config loader.
services/apps/packages_worker/src/bin/pom-fetcher.ts Adds runnable entrypoint with shutdown handling.
services/apps/packages_worker/package.json Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher.
scripts/services/pom-fetcher.yaml Adds docker-compose service definition for pom-fetcher.
pnpm-lock.yaml Updates lockfile for new deps (but includes an unexpected workspace importer).
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/config.ts Outdated
Comment thread pnpm-lock.yaml Outdated
Base automatically changed from feat/track-packages to main May 26, 2026 17:44
@ulemons ulemons changed the title Feat/pom fetcher feat: pom fetcher May 27, 2026
@ulemons ulemons self-assigned this May 27, 2026
Copilot AI review requested due to automatic review settings June 2, 2026 13:49
@ulemons ulemons force-pushed the feat/pom-fetcher branch from b0812f9 to d907907 Compare June 2, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 16 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Copilot AI review requested due to automatic review settings June 3, 2026 19:36
@ulemons ulemons force-pushed the feat/pom-fetcher branch from 1fef57d to 27c4836 Compare June 3, 2026 19:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 28 changed files in this pull request and generated 9 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread backend/.env.dist.local Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Copilot AI review requested due to automatic review settings June 3, 2026 20:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 15 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/maven/extract.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Copilot AI review requested due to automatic review settings June 3, 2026 20:40
@ulemons ulemons marked this pull request as ready for review June 3, 2026 20:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread backend/.env.dist.local Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 12 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread backend/.env.dist.local Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/repos.ts
Copilot AI review requested due to automatic review settings June 4, 2026 07:15
@ulemons ulemons force-pushed the feat/pom-fetcher branch from ec419bc to 52f5515 Compare June 4, 2026 07:15
ulemons added 17 commits June 4, 2026 16:19
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 4, 2026 14:20
@ulemons ulemons force-pushed the feat/pom-fetcher branch from ac166f0 to 40f8596 Compare June 4, 2026 14:20
dependentReposCount: pkg.dependentReposCount,
})
log.debug({ groupId, artifactId, version }, 'Version unchanged — skipping POM extraction')
return { status: 'unchanged', hopLimitReached: false }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchanged sync never clears queue

High Severity

When upstream release matches latest_version, the worker calls touchPackageSyncedAt but leaves ingestion_source unchanged. listMavenPackagesToSync keeps selecting those rows whenever ingestion_source is not a Maven worker outcome, regardless of fresh last_synced_at, so the same critical packages are metadata-polled every batch indefinitely.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 40f8596. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 26 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +1 to +10
/**
* Fetches maven-metadata.xml for a Maven artifact and returns the full version
* list plus the current release version.
*
* URL format:
* https://repo1.maven.org/maven2/{groupPath}/{artifactId}/maven-metadata.xml
*
* Returns null when the artifact is not found (404) or the metadata is
* malformed.
*/
Comment on lines +1 to +4
/**
* Core POM extraction logic — pure functions (no I/O side-effects, no DB calls).
* Callers are responsible for concurrency, retries, and persistence.
*/
Comment thread services/libs/data-access-layer/src/osspckgs/maintainers.ts
Comment thread services/apps/packages_worker/package.json
Comment thread backend/.env.dist.local
ulemons added 2 commits June 4, 2026 16:32
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
…by run mode

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 5, 2026 14:21
parent.artifactId,
parent.version,
depth + 1,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded parent POM recursion

High Severity

resolveWithInheritance follows parent POMs whenever licenses or SCM are missing, with no maximum depth or visited-coordinate guard. A cyclic or very deep parent chain can recurse until stack overflow or unbounded HTTP, despite the feature’s stated parent-hop limit.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.

})

export async function mavenCriticalWorkflow(): Promise<void> {
await processMavenCriticalBatch()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporal activity batch timeout risk

Medium Severity

processMavenCriticalBatch can process up to MAVEN_FETCHER_BATCH_SIZE (2000) critical packages with HTTP POM work per activity, while startToCloseTimeout is only 15 minutes. A backlog needing full extraction can exceed that window and fail the activity mid-batch.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.

declared_repository_url = COALESCE(EXCLUDED.declared_repository_url, packages.declared_repository_url),
repository_url = COALESCE(EXCLUDED.repository_url, packages.repository_url),
licenses = COALESCE(EXCLUDED.licenses, packages.licenses),
licenses_raw = COALESCE(EXCLUDED.licenses_raw, packages.licenses_raw),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null upserts never clear fields

Medium Severity

upsertPackage updates nullable columns with COALESCE(EXCLUDED.*, packages.*), so a full Maven sync that passes null for licenses, SCM URLs, or description cannot clear values already stored from an earlier sync.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ff13f2. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 26 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +1 to +4
/**
* Core POM extraction logic — pure functions (no I/O side-effects, no DB calls).
* Callers are responsible for concurrency, retries, and persistence.
*/
// with transient errors — we never do it. Maven coordinates are immutable, so a cached
// POM never goes stale; the LRU size cap is purely to bound memory.

const POM_CACHE_MAX_ENTRIES = 5_000
const missingScm = !scmUrl
const parent = extractParent(pom)

if (parent && (missingLicense || missingScm)) {
"monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=monitor tsx src/scripts/monitorOsspckgs.ts'",
"trigger-bootstrap": "SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts",
"trigger-bootstrap:local": "set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts",
"monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && node ../../../scripts/monitor-osspckgs.mjs'",
Comment on lines +39 to +40
// waiting for the staleness window. Errors/skips re-run only once stale, so a
// broken package isn't retried every pass.
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 8 total unresolved issues (including 7 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1ffbc9e. Configure here.

const durationSec = Math.round((Date.now() - phaseStartedAt) / 1000)
log.info({ phase: label, ...total, durationSec }, 'Phase complete')
return total
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfill loops on repeated failures

High Severity

runMavenCriticalBackfill keeps calling processBatch while any batch reports work, but metadata rate limits (and other paths that return error without updating the row) leave packages permanently “due” in listMavenPackagesToSync. The backfill then re-selects the same batch in a tight loop with no delay, hammering Maven Central and never reaching an empty batch to exit.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1ffbc9e. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants