Skip to content

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819

Draft
rzo1 wants to merge 6 commits into
masterfrom
reduce-distro-size
Draft

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819
rzo1 wants to merge 6 commits into
masterfrom
reduce-distro-size

Conversation

@rzo1

@rzo1 rzo1 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What & why

The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated.

Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱

Changes

  1. storm-autocreds no longer bundled (-79 MB). It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other external/* connectors); bin/storm-autocreds-fetch retrieves the plugin and its deps into extlib-daemon.

  2. storm-kafka-monitor no longer bundled (-38 MB). Only needed to show Kafka spout lag in the UI or to run bin/storm-kafka-monitor. bin/storm-kafka-monitor-fetch installs it into lib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLag detects it, shows an actionable message, logs once) and the wrapper prints a hint instead of ClassNotFound.

  3. lib-common de-duplication (-71 MB). The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once in lib-common/ (added to both classpaths via bin/storm.py) and removed from lib. dedup-libs.py only merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched.

Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality.

Notes for reviewers

Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full -Pdist native distribution build).

@rzo1 rzo1 requested review from GGraziadei, jnioche and reiabreu June 30, 2026 18:16
rzo1 added 6 commits July 1, 2026 13:31
storm-autocreds pulls in the full Hadoop/HBase client dependency tree
(~79 MB, 43 jars unique to it) but is only needed on secure (Kerberos)
clusters and is off by default. Ship only the README, consistent with
the other external/* connectors, and add bin/storm-autocreds-fetch to
retrieve the plugin and its runtime dependencies from Maven Central into
extlib-daemon on demand.

Also removes the now-unused storm-autocreds-bin assembly module.
The storm-kafka-monitor jars (and their Kafka client dependencies, ~38 MB)
are only needed to display Kafka spout lag in the UI or to run the
bin/storm-kafka-monitor command. Ship only the README, consistent with the
other external/* connectors, and add bin/storm-kafka-monitor-fetch to
retrieve the tool and its runtime dependencies from Maven Central into
lib-tools/storm-kafka-monitor on demand.

Guard the UI against the jars being absent: TopologySpoutLag now detects
whether storm-kafka-monitor is installed and, when it is not, surfaces an
actionable message (and logs it once) instead of failing the lag shell-out.
The bin/storm-kafka-monitor wrapper prints the same hint instead of a
ClassNotFound error.

Also removes the now-unused storm-kafka-monitor-bin assembly module.
Prepares de-duplication of the jars shared by the daemon (lib) and worker
(lib-worker) classpaths into a single lib-common directory. storm.py now
includes lib-common on both classpaths; when the directory is absent (older
layouts) it contributes nothing, so the change is backward compatible.
The worker classpath (lib-worker) is a byte-identical subset of the daemon
classpath (lib). dedup-libs.py moves the shared jars into a single lib-common
directory and removes the duplicate copies from lib, reclaiming ~71 MB. It only
de-duplicates byte-identical jars (same name and sha-256), so a version
mismatch is never silently merged; tool classpaths (lib-tools/*, lib-webapp)
are left untouched.

Paired with the lib-common classpath support in bin/storm.py. Wiring this into
the binary assembly is a follow-up (it must be validated by a full -Pdist
distribution build).
final-package now stages the daemon jars (copy-dependencies -> staging/lib) and
the shared jars (storm-client-bin's lib-worker -> staging/lib-common), runs
dedup-libs.py to remove the byte-identical copies from lib, and the assembly
packages the staged lib/ and lib-common/ directories. The storm-client-bin tree
(which only carried lib-worker) is no longer copied directly.

Verified through the prepare-package phase: staging/lib = 48 jars, staging/
lib-common = 40 jars, zero overlap, full 88-jar daemon set preserved across
lib + lib-common, ~71 MB reclaimed. The final tar/zip packaging requires a full
-Pdist (native) distribution build and must be validated in CI / on Linux.
The binary distribution now de-duplicates jars shared by the daemon and
worker classpaths into a lib-common directory, which bin/storm.py adds to
the classpath after the storm home wildcard. Update the expected classpath
assertions in test_storm_cli.py to include lib-common.
@rzo1 rzo1 force-pushed the reduce-distro-size branch from 7e37f99 to 15a9235 Compare July 1, 2026 11:31
@rzo1 rzo1 added this to the 3.0.0 milestone Jul 1, 2026
@reiabreu

reiabreu commented Jul 1, 2026 via email

Copy link
Copy Markdown
Contributor

@rzo1

rzo1 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

I know that this approach makes deployment harder for air gapped deployments 😄 - we can also ship two variants: a light variant and a full version (as we currently do but with dedupped deps) - but I am sick of downloading 0.5gb 😄😂

@GGraziadei

Copy link
Copy Markdown
Member

Hi,
Thanks for driving this effort and adding me into the discussion. I think this PR is a great step forward for cloud-native adoption; a ~400MB binary is definitely a friction point for containerized deployments, and reducing it by ~50% is a win, no discussion.

However, @reiabreu raises a crucial point regarding our enterprise downstream users. Maintaining a seamless out-of-the-box experience for air-gapped or offline production environments is paramount for stability.

While splitting this into two official distributions (a "ful" and a "slim" artifact) seems like an easy path, it increases the project's maintenance and CI.
Furthermore, from a pragmatic standpoint, most risk-averse enterprise users will likely default to the "full" package just to be safe, completely missing out on this optimization and defeating the purpose of reducing release footprint.

To bridge the gap between keeping the distribution lightweight and supporting offline clusters, could we extend the fetch utility to support internal/custom Maven mirrors?

Most enterprise environments without internet access already run internal artifact repositories (e.g., Nexus, Artifactory) that mirror Maven Central. If the fetch scripts could respect a custom repository URL (configured via an environment variable like STORM_MAVEN_REPO_URL ), those users could safely fetch these optional plugins from their internal network.
This would allow us to ship a single, lean binary without breaking enterprise workflows.

Thoughts?

@rzo1

rzo1 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @reiabreu and @GGraziadei - keeping the air-gapped/offline case working is exactly the goal.

The fetch scripts already support internal mirrors, because they delegate to Maven instead of hard-coding Maven Central, so Maven honors your existing mirror/proxy setup:

  • A host whose ~/.m2/settings.xml (or $MAVEN_HOME/conf/settings.xml) has a <mirror> with <mirrorOf>*</mirrorOf> pointing at your Nexus/Artifactory works with no extra flags.
  • Explicit file: bin/storm-kafka-monitor-fetch -- -s /etc/maven/settings.xml
  • Fully offline against a pre-seeded local repo: ... -- -Dmaven.repo.local=/srv/offline-repo -o

I went with Maven resolution over a raw download precisely for this: enterprises already have mirror + proxy + auth in settings.xml, so we reuse all of it (credentials, mirrorOf patterns) instead of reinventing it.

The key point: fetch is a provisioning-time step, not a per-node runtime requirement - and it fits how operators already work. In hardened environments you don't want production nodes downloading anything; artifacts flow through a controlled prep/staging pipeline. That's the model here: run fetch once on an admin/build/CI box with mirror access, prep for the target environment, and copy the jars into extlib-daemon / lib-tools/... (or bake them into the image). The locked-down nodes stay as sealed as they are today.

autocreds is the clearest case: turning it on is already environment-specific prep - you stage the cluster's hdfs-site.xml/hbase-site.xml, keytabs, and wire the principals + plugin classes into storm.yaml. Copying the Hadoop client jars is just one more step in a task the operator is doing anyway, so bundling ~79 MB of Hadoop into every download to save that one copy is a poor trade.

And where a batteries-included artifact really is wanted, we can express that as Docker image variants rather than a second source release - e.g. a lean apache/storm:<ver> plus a apache/storm:<ver>-full (Hadoop/Kafka libs pre-fetched). Enterprise users who need the libs present just pull the full tag - no custom rebuild - while the source release stays single and lean. That moves the full-vs-slim split into our image CI (cheap, already automated) and off the Apache release process (no extra signing/vote/artifacts), and it sidesteps the "everyone grabs full" concern since the default/base image stays slim.

For boxes with no system Maven, mvnw works too - we can auto-detect one, its distribution coming from the same internal mirror.

@reiabreu

reiabreu commented Jul 2, 2026 via email

Copy link
Copy Markdown
Contributor

@rzo1

rzo1 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Yep - I am also fine with producing two binaries. For Apache TomEE we produce 4 flavours (plume, plus, microprofile, webprofile), and honestly the maintenance overhead there is negligible - it's just additional assembly descriptors, the actual content is built once.

So concrete proposal for this PR:

  • Keep the lib-common de-duplication for the existing (full) distribution - that part is a pure win (-71 MB) with no impact on how anyone provisions Storm today. Your RPM use case keeps working unchanged: everything is still in the box.
  • Introduce a new flavour storm-lite as an additional binary artifact, which drops storm-autocreds and storm-kafka-monitor and relies on the fetch scripts instead. That's the one targeting containerized / cloud-native deployments.

The default download stays the full, batteries-included distribution, so nothing changes for air-gapped/offline users - storm-lite is opt-in for those who want the small footprint. That also addresses @GGraziadei's "everyone grabs full" concern from the other direction: full stays the safe default, lite is a conscious choice.

For 3.0.0-GA the only change to the release process would be one additional binary to test/verify during the vote. I think that's a fair trade. WDYT?

@jnioche

jnioche commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Sounds good @rzo1
I assume there will be a new lite docker image as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants