Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819
Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819rzo1 wants to merge 6 commits into
Conversation
storm-autocreds pulls in the full Hadoop/HBase client dependency tree (~79 MB, 43 jars unique to it) but is only needed on secure (Kerberos) clusters and is off by default. Ship only the README, consistent with the other external/* connectors, and add bin/storm-autocreds-fetch to retrieve the plugin and its runtime dependencies from Maven Central into extlib-daemon on demand. Also removes the now-unused storm-autocreds-bin assembly module.
The storm-kafka-monitor jars (and their Kafka client dependencies, ~38 MB) are only needed to display Kafka spout lag in the UI or to run the bin/storm-kafka-monitor command. Ship only the README, consistent with the other external/* connectors, and add bin/storm-kafka-monitor-fetch to retrieve the tool and its runtime dependencies from Maven Central into lib-tools/storm-kafka-monitor on demand. Guard the UI against the jars being absent: TopologySpoutLag now detects whether storm-kafka-monitor is installed and, when it is not, surfaces an actionable message (and logs it once) instead of failing the lag shell-out. The bin/storm-kafka-monitor wrapper prints the same hint instead of a ClassNotFound error. Also removes the now-unused storm-kafka-monitor-bin assembly module.
Prepares de-duplication of the jars shared by the daemon (lib) and worker (lib-worker) classpaths into a single lib-common directory. storm.py now includes lib-common on both classpaths; when the directory is absent (older layouts) it contributes nothing, so the change is backward compatible.
The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). dedup-libs.py moves the shared jars into a single lib-common directory and removes the duplicate copies from lib, reclaiming ~71 MB. It only de-duplicates byte-identical jars (same name and sha-256), so a version mismatch is never silently merged; tool classpaths (lib-tools/*, lib-webapp) are left untouched. Paired with the lib-common classpath support in bin/storm.py. Wiring this into the binary assembly is a follow-up (it must be validated by a full -Pdist distribution build).
final-package now stages the daemon jars (copy-dependencies -> staging/lib) and the shared jars (storm-client-bin's lib-worker -> staging/lib-common), runs dedup-libs.py to remove the byte-identical copies from lib, and the assembly packages the staged lib/ and lib-common/ directories. The storm-client-bin tree (which only carried lib-worker) is no longer copied directly. Verified through the prepare-package phase: staging/lib = 48 jars, staging/ lib-common = 40 jars, zero overlap, full 88-jar daemon set preserved across lib + lib-common, ~71 MB reclaimed. The final tar/zip packaging requires a full -Pdist (native) distribution build and must be validated in CI / on Linux.
The binary distribution now de-duplicates jars shared by the daemon and worker classpaths into a lib-common directory, which bin/storm.py adds to the classpath after the storm home wildcard. Update the expected classpath assertions in test_storm_cli.py to include lib-common.
|
Hi Richard! I will have a closer look ASAP. But a quick note just looking
at the description: a release will not have the storm-kafka-monitor out of
the box; it will be pulled as needed at runtime. This is conditional on how
organizations provision their systems and the type of connectivity those
boxes have to the internet. A full-fledged binary distribution is the only
way we can cover all cases, even if that means being overprovisioned.
…On Tue, 30 Jun 2026 at 19:16, Richard Zowalla ***@***.***> wrote:
What & why
The binary distribution ships ~395 MB of jars, much of it optional or
duplicated. This PR trims it substantially without removing any capability:
optional pieces become fetch-on-demand, and jars shared between the daemon
and worker classpaths are de-duplicated.
Smaller artifact means less download/storage/registry bandwidth and a
smaller container image, and a lighter CI/CD carbon footprint. 🌱
Changes
1.
*storm-autocreds no longer bundled (-79 MB).* It pulls the full
Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and
is off by default. Now ships only the README (like the other external/*
connectors); bin/storm-autocreds-fetch retrieves the plugin and its
deps into extlib-daemon.
2.
*storm-kafka-monitor no longer bundled (-38 MB).* Only needed to show
Kafka spout lag in the UI or to run bin/storm-kafka-monitor.
bin/storm-kafka-monitor-fetch installs it into
lib-tools/storm-kafka-monitor. The UI degrades gracefully when it is
absent (TopologySpoutLag detects it, shows an actionable message, logs
once) and the wrapper prints a hint instead of ClassNotFound.
3.
*lib-common de-duplication (-71 MB).* The worker classpath (lib-worker)
is a byte-identical subset of the daemon classpath (lib). Shared jars
are now kept once in lib-common/ (added to both classpaths via
bin/storm.py) and removed from lib. dedup-libs.py only merges
byte-identical jars (name and sha-256), so no version is silently merged;
tool classpaths are left untouched.
Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of
functionality.
Notes for reviewers
Opening as draft to get early feedback on the approach. Still need to
verify on a Linux system (full -Pdist native distribution build).
------------------------------
You can view, comment on, or merge this pull request online at:
#8819
Commit Summary
- 8cf5dd8
<8cf5dd8>
build: stop bundling storm-autocreds in the binary distribution
- 696dbc7
<696dbc7>
build: stop bundling storm-kafka-monitor in the binary distribution
- 3aeac63
<3aeac63>
build: add lib-common to the daemon and worker classpaths
- 4bd2fda
<4bd2fda>
build: add dedup-libs script to share daemon/worker jars via lib-common
- 97d2922
<97d2922>
build: wire lib-common de-duplication into the binary assembly
File Changes
(16 files <https://github.com/apache/storm/pull/8819/files>)
- *A* bin/storm-autocreds-fetch
<https://github.com/apache/storm/pull/8819/files#diff-3e33fb2f1b457ab4c9461a4cb3274ea00974746076e2efc8d71f0c1076c0618d>
(134)
- *M* bin/storm-kafka-monitor
<https://github.com/apache/storm/pull/8819/files#diff-a806c19bd872353d150ac6f85e78696285c99021ec81e375cdb904105bf6b16b>
(9)
- *A* bin/storm-kafka-monitor-fetch
<https://github.com/apache/storm/pull/8819/files#diff-3507481db71b11836d4e44a6308c2ace3cc3d545d51368f06efbd0eec6d467c8>
(133)
- *M* bin/storm.py
<https://github.com/apache/storm/pull/8819/files#diff-a2bfdaa7016689362ea4b267e85618a79cdabe435f62e3f40922042ee0eda746>
(11)
- *M* bin/test_storm.py
<https://github.com/apache/storm/pull/8819/files#diff-e34f82e2a1f7d5319006726ad01257d99ec140bd20cbecbf28e90248d8622d1f>
(13)
- *A* external/storm-autocreds/README.md
<https://github.com/apache/storm/pull/8819/files#diff-236662b9419b8399401ec9ad60b84cb7f7c66eca2db8e844018afa55fdd5aa56>
(101)
- *M* external/storm-kafka-monitor/README.md
<https://github.com/apache/storm/pull/8819/files#diff-619cc223ea00cdeadb622a6daa7b21571fe89bf1d634d08f2791dbd2f84422d2>
(24)
- *M* storm-core/src/jvm/org/apache/storm/utils/TopologySpoutLag.java
<https://github.com/apache/storm/pull/8819/files#diff-6a2b2f6b97f1f48abfec324eed8b88442cef3e1e7b85b6a3870557ed9c77f288>
(36)
- *M* storm-dist/binary/final-package/pom.xml
<https://github.com/apache/storm/pull/8819/files#diff-45a6af025058edd4abd6d0d30a89d9bae2ed39a1b04f8dbb45d24a5d6b45188a>
(67)
- *M* storm-dist/binary/final-package/src/main/assembly/binary.xml
<https://github.com/apache/storm/pull/8819/files#diff-ce85984275b435c0dbfa5ea5894afd5617a2bfe3e7a83da24307437f9c5df1ca>
(46)
- *A* storm-dist/binary/final-package/src/main/scripts/dedup-libs.py
<https://github.com/apache/storm/pull/8819/files#diff-2cbebf9b5baea2e59daf389c4e72f59fc37a65c97b1d18e1f6b9f9d9e21e0408>
(104)
- *M* storm-dist/binary/pom.xml
<https://github.com/apache/storm/pull/8819/files#diff-c8dc3f04eecd6165587214ab962cf8684c19f94f2f83cb1392f9ad29c677a18d>
(2)
- *D* storm-dist/binary/storm-autocreds-bin/pom.xml
<https://github.com/apache/storm/pull/8819/files#diff-b113238a66a1c9c9b8bb7c3ad47bdf98207e8d9bebbcea490735bd6c6e9ad8ee>
(63)
- *D*
storm-dist/binary/storm-autocreds-bin/src/main/assembly/storm-autocreds.xml
<https://github.com/apache/storm/pull/8819/files#diff-ecff7ffca0d7bb4801e5d73cb9656abf15934c489e51b97e44e6324da9f5de49>
(33)
- *D* storm-dist/binary/storm-kafka-monitor-bin/pom.xml
<https://github.com/apache/storm/pull/8819/files#diff-07714eeba780f3e65d20c2de3766264ebc8ff2c92be09b47924f21f64f042779>
(69)
- *D*
storm-dist/binary/storm-kafka-monitor-bin/src/main/assembly/storm-kafka-monitor.xml
<https://github.com/apache/storm/pull/8819/files#diff-f07c2d882c56fb747595d07a9eee81e91cd5ee7d9c70d232eaf63eafe6f30567>
(33)
Patch Links:
- https://github.com/apache/storm/pull/8819.patch
- https://github.com/apache/storm/pull/8819.diff
—
Reply to this email directly, view it on GitHub
<#8819?email_source=notifications&email_token=AAG5GIQLJVAPLMH7D427VH35CP7ZXA5CNFSNUABEM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UF4ZTSNRVGM4TAMRTGWTHEZLBONXW5MDSMV3GSZLXL5ZGK4LVMVZXIZLEUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GIVLTAJSY645ONYQGKL5CP7ZXAVCNFSNUABEKJSXA33TNF2G64TZHMYTIMJTGU2DOMB3JFZXG5LFHM2DONZZGM2TSNJQHCQXMAQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
|
I know that this approach makes deployment harder for air gapped deployments 😄 - we can also ship two variants: a light variant and a full version (as we currently do but with dedupped deps) - but I am sick of downloading 0.5gb 😄😂 |
|
Hi, However, @reiabreu raises a crucial point regarding our enterprise downstream users. Maintaining a seamless out-of-the-box experience for air-gapped or offline production environments is paramount for stability. While splitting this into two official distributions (a "ful" and a "slim" artifact) seems like an easy path, it increases the project's maintenance and CI. To bridge the gap between keeping the distribution lightweight and supporting offline clusters, could we extend the fetch utility to support internal/custom Maven mirrors? Most enterprise environments without internet access already run internal artifact repositories (e.g., Nexus, Artifactory) that mirror Maven Central. If the fetch scripts could respect a custom repository URL (configured via an environment variable like Thoughts? |
|
Thanks @reiabreu and @GGraziadei - keeping the air-gapped/offline case working is exactly the goal. The fetch scripts already support internal mirrors, because they delegate to Maven instead of hard-coding Maven Central, so Maven honors your existing mirror/proxy setup:
I went with Maven resolution over a raw download precisely for this: enterprises already have mirror + proxy + auth in settings.xml, so we reuse all of it (credentials, The key point: fetch is a provisioning-time step, not a per-node runtime requirement - and it fits how operators already work. In hardened environments you don't want production nodes downloading anything; artifacts flow through a controlled prep/staging pipeline. That's the model here: run fetch once on an admin/build/CI box with mirror access, prep for the target environment, and copy the jars into autocreds is the clearest case: turning it on is already environment-specific prep - you stage the cluster's And where a batteries-included artifact really is wanted, we can express that as Docker image variants rather than a second source release - e.g. a lean For boxes with no system Maven, |
|
Your arguments are valid, but I think we should let end users decide how
they want to package Storm. We could offer both the full-fledged
distribution and the reduced-size one, as you mentioned
Full disclosure: I use Storm by packing the full binaries into a RPM and
while it might not be optimal, I don't have to worry about special use
cases for whoever uses the RPM. It is all available.
I will check the *lib-common de-duplication* part ASAP, but seems like
something we want.
…On Wed, 1 Jul 2026 at 18:51, Richard Zowalla ***@***.***> wrote:
*rzo1* left a comment (apache/storm#8819)
<#8819 (comment)>
Thanks @reiabreu <https://github.com/reiabreu> and @GGraziadei
<https://github.com/GGraziadei> - keeping the air-gapped/offline case
working is exactly the goal.
*The fetch scripts already support internal mirrors*, because they
delegate to Maven instead of hard-coding Maven Central, so Maven honors
your existing mirror/proxy setup:
- A host whose ~/.m2/settings.xml (or $MAVEN_HOME/conf/settings.xml)
has a <mirror> with <mirrorOf>*</mirrorOf> pointing at your
Nexus/Artifactory works with *no extra flags*.
- Explicit file: bin/storm-kafka-monitor-fetch -- -s
/etc/maven/settings.xml
- Fully offline against a pre-seeded local repo: ... --
-Dmaven.repo.local=/srv/offline-repo -o
I went with Maven resolution over a raw download precisely for this:
enterprises already have mirror + proxy + auth in settings.xml, so we reuse
all of it (credentials, mirrorOf patterns) instead of reinventing it.
*The key point: fetch is a provisioning-time step, not a per-node runtime
requirement - and it fits how operators already work.* In hardened
environments you *don't* want production nodes downloading anything;
artifacts flow through a controlled prep/staging pipeline. That's the model
here: run fetch once on an admin/build/CI box with mirror access, prep for
the target environment, and copy the jars into extlib-daemon /
lib-tools/... (or bake them into the image). The locked-down nodes stay
as sealed as they are today.
autocreds is the clearest case: turning it on is *already*
environment-specific prep - you stage the cluster's hdfs-site.xml/
hbase-site.xml, keytabs, and wire the principals + plugin classes into
storm.yaml. Copying the Hadoop client jars is just one more step in a
task the operator is doing anyway, so bundling ~79 MB of Hadoop into
*every* download to save that one copy is a poor trade.
*And where a batteries-included artifact really is wanted, we can express
that as Docker image variants rather than a second source release* - e.g.
a lean apache/storm:<ver> plus a apache/storm:<ver>-full (Hadoop/Kafka
libs pre-fetched). Enterprise users who need the libs present just pull the
full tag - no custom rebuild - while the source release stays single and
lean. That moves the full-vs-slim split into our image CI (cheap, already
automated) and off the Apache release process (no extra
signing/vote/artifacts), and it sidesteps the "everyone grabs full" concern
since the default/base image stays slim.
For boxes with no system Maven, mvnw works too - we can auto-detect one,
its distribution coming from the same internal mirror.
—
Reply to this email directly, view it on GitHub
<#8819?email_source=notifications&email_token=AAG5GIWYW5XNVRLWNNET6KT5CVFSVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIOBVHA3DGNRXG442M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4858636779>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5GIVKM6KTNZYENI7FFWD5CVFSVAVCNFSNUABEKJSXA33TNF2G64TZHMYTIMJTGU2DOMB3JFZXG5LFHM2DONZZGM2TSNJQHCQXMAQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Yep - I am also fine with producing two binaries. For Apache TomEE we produce 4 flavours (plume, plus, microprofile, webprofile), and honestly the maintenance overhead there is negligible - it's just additional assembly descriptors, the actual content is built once. So concrete proposal for this PR:
The default download stays the full, batteries-included distribution, so nothing changes for air-gapped/offline users - For 3.0.0-GA the only change to the release process would be one additional binary to test/verify during the vote. I think that's a fair trade. WDYT? |
|
Sounds good @rzo1 |
What & why
The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated.
Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱
Changes
storm-autocreds no longer bundled (-79 MB). It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other
external/*connectors);bin/storm-autocreds-fetchretrieves the plugin and its deps intoextlib-daemon.storm-kafka-monitor no longer bundled (-38 MB). Only needed to show Kafka spout lag in the UI or to run
bin/storm-kafka-monitor.bin/storm-kafka-monitor-fetchinstalls it intolib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLagdetects it, shows an actionable message, logs once) and the wrapper prints a hint instead ofClassNotFound.lib-commonde-duplication (-71 MB). The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once inlib-common/(added to both classpaths viabin/storm.py) and removed fromlib.dedup-libs.pyonly merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched.Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality.
Notes for reviewers
Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full
-Pdistnative distribution build).