Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars by rzo1 · Pull Request #8819 · apache/storm

rzo1 · 2026-06-30T18:16:36Z

What & why

The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated.

Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱

Changes

storm-autocreds no longer bundled (-79 MB). It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other external/* connectors); bin/storm-autocreds-fetch retrieves the plugin and its deps into extlib-daemon.
storm-kafka-monitor no longer bundled (-38 MB). Only needed to show Kafka spout lag in the UI or to run bin/storm-kafka-monitor. bin/storm-kafka-monitor-fetch installs it into lib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLag detects it, shows an actionable message, logs once) and the wrapper prints a hint instead of ClassNotFound.
lib-common de-duplication (-71 MB). The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once in lib-common/ (added to both classpaths via bin/storm.py) and removed from lib. dedup-libs.py only merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched.

Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality.

Notes for reviewers

Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full -Pdist native distribution build).

storm-autocreds pulls in the full Hadoop/HBase client dependency tree (~79 MB, 43 jars unique to it) but is only needed on secure (Kerberos) clusters and is off by default. Ship only the README, consistent with the other external/* connectors, and add bin/storm-autocreds-fetch to retrieve the plugin and its runtime dependencies from Maven Central into extlib-daemon on demand. Also removes the now-unused storm-autocreds-bin assembly module.

The storm-kafka-monitor jars (and their Kafka client dependencies, ~38 MB) are only needed to display Kafka spout lag in the UI or to run the bin/storm-kafka-monitor command. Ship only the README, consistent with the other external/* connectors, and add bin/storm-kafka-monitor-fetch to retrieve the tool and its runtime dependencies from Maven Central into lib-tools/storm-kafka-monitor on demand. Guard the UI against the jars being absent: TopologySpoutLag now detects whether storm-kafka-monitor is installed and, when it is not, surfaces an actionable message (and logs it once) instead of failing the lag shell-out. The bin/storm-kafka-monitor wrapper prints the same hint instead of a ClassNotFound error. Also removes the now-unused storm-kafka-monitor-bin assembly module.

Prepares de-duplication of the jars shared by the daemon (lib) and worker (lib-worker) classpaths into a single lib-common directory. storm.py now includes lib-common on both classpaths; when the directory is absent (older layouts) it contributes nothing, so the change is backward compatible.

The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). dedup-libs.py moves the shared jars into a single lib-common directory and removes the duplicate copies from lib, reclaiming ~71 MB. It only de-duplicates byte-identical jars (same name and sha-256), so a version mismatch is never silently merged; tool classpaths (lib-tools/*, lib-webapp) are left untouched. Paired with the lib-common classpath support in bin/storm.py. Wiring this into the binary assembly is a follow-up (it must be validated by a full -Pdist distribution build).

final-package now stages the daemon jars (copy-dependencies -> staging/lib) and the shared jars (storm-client-bin's lib-worker -> staging/lib-common), runs dedup-libs.py to remove the byte-identical copies from lib, and the assembly packages the staged lib/ and lib-common/ directories. The storm-client-bin tree (which only carried lib-worker) is no longer copied directly. Verified through the prepare-package phase: staging/lib = 48 jars, staging/ lib-common = 40 jars, zero overlap, full 88-jar daemon set preserved across lib + lib-common, ~71 MB reclaimed. The final tar/zip packaging requires a full -Pdist (native) distribution build and must be validated in CI / on Linux.

The binary distribution now de-duplicates jars shared by the daemon and worker classpaths into a lib-common directory, which bin/storm.py adds to the classpath after the storm home wildcard. Update the expected classpath assertions in test_storm_cli.py to include lib-common.

reiabreu · 2026-07-01T15:59:04Z

Hi Richard! I will have a closer look ASAP. But a quick note just looking at the description: a release will not have the storm-kafka-monitor out of the box; it will be pulled as needed at runtime. This is conditional on how organizations provision their systems and the type of connectivity those boxes have to the internet. A full-fledged binary distribution is the only way we can cover all cases, even if that means being overprovisioned.

…

On Tue, 30 Jun 2026 at 19:16, Richard Zowalla ***@***.***> wrote: What & why The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated. Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱 Changes 1. *storm-autocreds no longer bundled (-79 MB).* It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other external/* connectors); bin/storm-autocreds-fetch retrieves the plugin and its deps into extlib-daemon. 2. *storm-kafka-monitor no longer bundled (-38 MB).* Only needed to show Kafka spout lag in the UI or to run bin/storm-kafka-monitor. bin/storm-kafka-monitor-fetch installs it into lib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLag detects it, shows an actionable message, logs once) and the wrapper prints a hint instead of ClassNotFound. 3. *lib-common de-duplication (-71 MB).* The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once in lib-common/ (added to both classpaths via bin/storm.py) and removed from lib. dedup-libs.py only merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched. Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality. Notes for reviewers Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full -Pdist native distribution build). ------------------------------ You can view, comment on, or merge this pull request online at: #8819 Commit Summary - 8cf5dd8 <8cf5dd8> build: stop bundling storm-autocreds in the binary distribution - 696dbc7 <696dbc7> build: stop bundling storm-kafka-monitor in the binary distribution - 3aeac63 <3aeac63> build: add lib-common to the daemon and worker classpaths - 4bd2fda <4bd2fda> build: add dedup-libs script to share daemon/worker jars via lib-common - 97d2922 <97d2922> build: wire lib-common de-duplication into the binary assembly File Changes (16 files <https://github.com/apache/storm/pull/8819/files>) - *A* bin/storm-autocreds-fetch <https://github.com/apache/storm/pull/8819/files#diff-3e33fb2f1b457ab4c9461a4cb3274ea00974746076e2efc8d71f0c1076c0618d> (134) - *M* bin/storm-kafka-monitor <https://github.com/apache/storm/pull/8819/files#diff-a806c19bd872353d150ac6f85e78696285c99021ec81e375cdb904105bf6b16b> (9) - *A* bin/storm-kafka-monitor-fetch <https://github.com/apache/storm/pull/8819/files#diff-3507481db71b11836d4e44a6308c2ace3cc3d545d51368f06efbd0eec6d467c8> (133) - *M* bin/storm.py <https://github.com/apache/storm/pull/8819/files#diff-a2bfdaa7016689362ea4b267e85618a79cdabe435f62e3f40922042ee0eda746> (11) - *M* bin/test_storm.py <https://github.com/apache/storm/pull/8819/files#diff-e34f82e2a1f7d5319006726ad01257d99ec140bd20cbecbf28e90248d8622d1f> (13) - *A* external/storm-autocreds/README.md <https://github.com/apache/storm/pull/8819/files#diff-236662b9419b8399401ec9ad60b84cb7f7c66eca2db8e844018afa55fdd5aa56> (101) - *M* external/storm-kafka-monitor/README.md <https://github.com/apache/storm/pull/8819/files#diff-619cc223ea00cdeadb622a6daa7b21571fe89bf1d634d08f2791dbd2f84422d2> (24) - *M* storm-core/src/jvm/org/apache/storm/utils/TopologySpoutLag.java <https://github.com/apache/storm/pull/8819/files#diff-6a2b2f6b97f1f48abfec324eed8b88442cef3e1e7b85b6a3870557ed9c77f288> (36) - *M* storm-dist/binary/final-package/pom.xml <https://github.com/apache/storm/pull/8819/files#diff-45a6af025058edd4abd6d0d30a89d9bae2ed39a1b04f8dbb45d24a5d6b45188a> (67) - *M* storm-dist/binary/final-package/src/main/assembly/binary.xml <https://github.com/apache/storm/pull/8819/files#diff-ce85984275b435c0dbfa5ea5894afd5617a2bfe3e7a83da24307437f9c5df1ca> (46) - *A* storm-dist/binary/final-package/src/main/scripts/dedup-libs.py <https://github.com/apache/storm/pull/8819/files#diff-2cbebf9b5baea2e59daf389c4e72f59fc37a65c97b1d18e1f6b9f9d9e21e0408> (104) - *M* storm-dist/binary/pom.xml <https://github.com/apache/storm/pull/8819/files#diff-c8dc3f04eecd6165587214ab962cf8684c19f94f2f83cb1392f9ad29c677a18d> (2) - *D* storm-dist/binary/storm-autocreds-bin/pom.xml <https://github.com/apache/storm/pull/8819/files#diff-b113238a66a1c9c9b8bb7c3ad47bdf98207e8d9bebbcea490735bd6c6e9ad8ee> (63) - *D* storm-dist/binary/storm-autocreds-bin/src/main/assembly/storm-autocreds.xml <https://github.com/apache/storm/pull/8819/files#diff-ecff7ffca0d7bb4801e5d73cb9656abf15934c489e51b97e44e6324da9f5de49> (33) - *D* storm-dist/binary/storm-kafka-monitor-bin/pom.xml <https://github.com/apache/storm/pull/8819/files#diff-07714eeba780f3e65d20c2de3766264ebc8ff2c92be09b47924f21f64f042779> (69) - *D* storm-dist/binary/storm-kafka-monitor-bin/src/main/assembly/storm-kafka-monitor.xml <https://github.com/apache/storm/pull/8819/files#diff-f07c2d882c56fb747595d07a9eee81e91cd5ee7d9c70d232eaf63eafe6f30567> (33) Patch Links: - https://github.com/apache/storm/pull/8819.patch - https://github.com/apache/storm/pull/8819.diff — Reply to this email directly, view it on GitHub <#8819?email_source=notifications&email_token=AAG5GIQLJVAPLMH7D427VH35CP7ZXA5CNFSNUABEM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UF4ZTSNRVGM4TAMRTGWTHEZLBONXW5MDSMV3GSZLXL5ZGK4LVMVZXIZLEUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG5GIVLTAJSY645ONYQGKL5CP7ZXAVCNFSNUABEKJSXA33TNF2G64TZHMYTIMJTGU2DOMB3JFZXG5LFHM2DONZZGM2TSNJQHCQXMAQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

rzo1 · 2026-07-01T16:23:22Z

I know that this approach makes deployment harder for air gapped deployments 😄 - we can also ship two variants: a light variant and a full version (as we currently do but with dedupped deps) - but I am sick of downloading 0.5gb 😄😂

GGraziadei · 2026-07-01T16:37:16Z

Hi,
Thanks for driving this effort and adding me into the discussion. I think this PR is a great step forward for cloud-native adoption; a ~400MB binary is definitely a friction point for containerized deployments, and reducing it by ~50% is a win, no discussion.

However, @reiabreu raises a crucial point regarding our enterprise downstream users. Maintaining a seamless out-of-the-box experience for air-gapped or offline production environments is paramount for stability.

While splitting this into two official distributions (a "ful" and a "slim" artifact) seems like an easy path, it increases the project's maintenance and CI.
Furthermore, from a pragmatic standpoint, most risk-averse enterprise users will likely default to the "full" package just to be safe, completely missing out on this optimization and defeating the purpose of reducing release footprint.

To bridge the gap between keeping the distribution lightweight and supporting offline clusters, could we extend the fetch utility to support internal/custom Maven mirrors?

Most enterprise environments without internet access already run internal artifact repositories (e.g., Nexus, Artifactory) that mirror Maven Central. If the fetch scripts could respect a custom repository URL (configured via an environment variable like STORM_MAVEN_REPO_URL ), those users could safely fetch these optional plugins from their internal network.
This would allow us to ship a single, lean binary without breaking enterprise workflows.

Thoughts?

rzo1 · 2026-07-01T17:51:16Z

Thanks @reiabreu and @GGraziadei - keeping the air-gapped/offline case working is exactly the goal.

The fetch scripts already support internal mirrors, because they delegate to Maven instead of hard-coding Maven Central, so Maven honors your existing mirror/proxy setup:

A host whose ~/.m2/settings.xml (or $MAVEN_HOME/conf/settings.xml) has a <mirror> with <mirrorOf>*</mirrorOf> pointing at your Nexus/Artifactory works with no extra flags.
Explicit file: bin/storm-kafka-monitor-fetch -- -s /etc/maven/settings.xml
Fully offline against a pre-seeded local repo: ... -- -Dmaven.repo.local=/srv/offline-repo -o

I went with Maven resolution over a raw download precisely for this: enterprises already have mirror + proxy + auth in settings.xml, so we reuse all of it (credentials, mirrorOf patterns) instead of reinventing it.

The key point: fetch is a provisioning-time step, not a per-node runtime requirement - and it fits how operators already work. In hardened environments you don't want production nodes downloading anything; artifacts flow through a controlled prep/staging pipeline. That's the model here: run fetch once on an admin/build/CI box with mirror access, prep for the target environment, and copy the jars into extlib-daemon / lib-tools/... (or bake them into the image). The locked-down nodes stay as sealed as they are today.

autocreds is the clearest case: turning it on is already environment-specific prep - you stage the cluster's hdfs-site.xml/hbase-site.xml, keytabs, and wire the principals + plugin classes into storm.yaml. Copying the Hadoop client jars is just one more step in a task the operator is doing anyway, so bundling ~79 MB of Hadoop into every download to save that one copy is a poor trade.

And where a batteries-included artifact really is wanted, we can express that as Docker image variants rather than a second source release - e.g. a lean apache/storm:<ver> plus a apache/storm:<ver>-full (Hadoop/Kafka libs pre-fetched). Enterprise users who need the libs present just pull the full tag - no custom rebuild - while the source release stays single and lean. That moves the full-vs-slim split into our image CI (cheap, already automated) and off the Apache release process (no extra signing/vote/artifacts), and it sidesteps the "everyone grabs full" concern since the default/base image stays slim.

For boxes with no system Maven, mvnw works too - we can auto-detect one, its distribution coming from the same internal mirror.

reiabreu · 2026-07-02T11:36:53Z

Your arguments are valid, but I think we should let end users decide how they want to package Storm. We could offer both the full-fledged distribution and the reduced-size one, as you mentioned Full disclosure: I use Storm by packing the full binaries into a RPM and while it might not be optimal, I don't have to worry about special use cases for whoever uses the RPM. It is all available. I will check the *lib-common de-duplication* part ASAP, but seems like something we want.

…

On Wed, 1 Jul 2026 at 18:51, Richard Zowalla ***@***.***> wrote: *rzo1* left a comment (apache/storm#8819) <#8819 (comment)> Thanks @reiabreu <https://github.com/reiabreu> and @GGraziadei <https://github.com/GGraziadei> - keeping the air-gapped/offline case working is exactly the goal. *The fetch scripts already support internal mirrors*, because they delegate to Maven instead of hard-coding Maven Central, so Maven honors your existing mirror/proxy setup: - A host whose ~/.m2/settings.xml (or $MAVEN_HOME/conf/settings.xml) has a <mirror> with <mirrorOf>*</mirrorOf> pointing at your Nexus/Artifactory works with *no extra flags*. - Explicit file: bin/storm-kafka-monitor-fetch -- -s /etc/maven/settings.xml - Fully offline against a pre-seeded local repo: ... -- -Dmaven.repo.local=/srv/offline-repo -o I went with Maven resolution over a raw download precisely for this: enterprises already have mirror + proxy + auth in settings.xml, so we reuse all of it (credentials, mirrorOf patterns) instead of reinventing it. *The key point: fetch is a provisioning-time step, not a per-node runtime requirement - and it fits how operators already work.* In hardened environments you *don't* want production nodes downloading anything; artifacts flow through a controlled prep/staging pipeline. That's the model here: run fetch once on an admin/build/CI box with mirror access, prep for the target environment, and copy the jars into extlib-daemon / lib-tools/... (or bake them into the image). The locked-down nodes stay as sealed as they are today. autocreds is the clearest case: turning it on is *already* environment-specific prep - you stage the cluster's hdfs-site.xml/ hbase-site.xml, keytabs, and wire the principals + plugin classes into storm.yaml. Copying the Hadoop client jars is just one more step in a task the operator is doing anyway, so bundling ~79 MB of Hadoop into *every* download to save that one copy is a poor trade. *And where a batteries-included artifact really is wanted, we can express that as Docker image variants rather than a second source release* - e.g. a lean apache/storm:<ver> plus a apache/storm:<ver>-full (Hadoop/Kafka libs pre-fetched). Enterprise users who need the libs present just pull the full tag - no custom rebuild - while the source release stays single and lean. That moves the full-vs-slim split into our image CI (cheap, already automated) and off the Apache release process (no extra signing/vote/artifacts), and it sidesteps the "everyone grabs full" concern since the default/base image stays slim. For boxes with no system Maven, mvnw works too - we can auto-detect one, its distribution coming from the same internal mirror. — Reply to this email directly, view it on GitHub <#8819?email_source=notifications&email_token=AAG5GIWYW5XNVRLWNNET6KT5CVFSVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIOBVHA3DGNRXG442M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4858636779>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG5GIVKM6KTNZYENI7FFWD5CVFSVAVCNFSNUABEKJSXA33TNF2G64TZHMYTIMJTGU2DOMB3JFZXG5LFHM2DONZZGM2TSNJQHCQXMAQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rzo1 · 2026-07-02T11:48:21Z

Yep - I am also fine with producing two binaries. For Apache TomEE we produce 4 flavours (plume, plus, microprofile, webprofile), and honestly the maintenance overhead there is negligible - it's just additional assembly descriptors, the actual content is built once.

So concrete proposal for this PR:

Keep the lib-common de-duplication for the existing (full) distribution - that part is a pure win (-71 MB) with no impact on how anyone provisions Storm today. Your RPM use case keeps working unchanged: everything is still in the box.
Introduce a new flavour storm-lite as an additional binary artifact, which drops storm-autocreds and storm-kafka-monitor and relies on the fetch scripts instead. That's the one targeting containerized / cloud-native deployments.

The default download stays the full, batteries-included distribution, so nothing changes for air-gapped/offline users - storm-lite is opt-in for those who want the small footprint. That also addresses @GGraziadei's "everyone grabs full" concern from the other direction: full stays the safe default, lite is a conscious choice.

For 3.0.0-GA the only change to the release process would be one additional binary to test/verify during the vote. I think that's a fair trade. WDYT?

jnioche · 2026-07-02T11:53:10Z

Sounds good @rzo1
I assume there will be a new lite docker image as well?

rzo1 requested review from GGraziadei, jnioche and reiabreu June 30, 2026 18:16

rzo1 added 6 commits July 1, 2026 13:31

rzo1 force-pushed the reduce-distro-size branch from 7e37f99 to 15a9235 Compare July 1, 2026 11:31

rzo1 added this to the 3.0.0 milestone Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819
rzo1 wants to merge 6 commits into
masterfrom
reduce-distro-size

rzo1 commented Jun 30, 2026

Uh oh!

reiabreu commented Jul 1, 2026 via email

Uh oh!

rzo1 commented Jul 1, 2026

Uh oh!

GGraziadei commented Jul 1, 2026

Uh oh!

rzo1 commented Jul 1, 2026

Uh oh!

reiabreu commented Jul 2, 2026 via email

Uh oh!

rzo1 commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

rzo1 commented Jun 30, 2026

What & why

Changes

Notes for reviewers

Uh oh!

reiabreu commented Jul 1, 2026 via email

Uh oh!

rzo1 commented Jul 1, 2026

Uh oh!

GGraziadei commented Jul 1, 2026

Uh oh!

rzo1 commented Jul 1, 2026

Uh oh!

reiabreu commented Jul 2, 2026 via email

Uh oh!

rzo1 commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants