Skip to content

Improve SupportBundleService performance, stability, and test coverage#25555

Draft
todvora wants to merge 15 commits intomasterfrom
feature/support-bundle-tweaks
Draft

Improve SupportBundleService performance, stability, and test coverage#25555
todvora wants to merge 15 commits intomasterfrom
feature/support-bundle-tweaks

Conversation

@todvora
Copy link
Copy Markdown
Contributor

@todvora todvora commented Apr 3, 2026

The support bundle service had several performance and correctness issues that became significant at scale: sequential API fan-outs, thread pool starvation risk, memory pressure from buffering large log files, and no test coverage beyond a few unit cases.

/nocl yet

Description

Parallelism

Per-node data collection (thread dump, metrics, system stats, certificates) now runs as independent concurrent tasks instead of sequentially per node. Cluster-wide fan-out calls (requestOnAllNodes) are moved to a short-lived orchestration executor to avoid thread-pool starvation — previously wrapping them in supplyAsync on the shared executor could deadlock when the pool was exhausted by blocking fan-out wrappers waiting on sub-tasks submitted to the same pool.

Efficiency & correctness

Datanode log transfer switches from ResponseBody.bytes() (full heap buffer) to byteStream().transferTo() to avoid large allocations. Files.walk() in listBundles() is replaced with Files.list() since the directory is flat. closeEntry() in writeZipFile() is now guarded so it only runs if putNextEntry() succeeded, preventing a misleading second exception when zip entry creation fails. SimpleDateFormat (not thread-safe, per-call) is replaced with a static DateTimeFormatter. Several internal helpers are simplified or inlined, and public constants that have no external users are made private. Call timeout increased from 10 s to 30 s.

Motivation and Context

Fixes or at least improves https://github.com/Graylog2/graylog-plugin-enterprise/issues/13625

How Has This Been Tested?

  • Added SupportBundleResourceIT — full backend integration test covering all endpoints: manifest, logfile download, bundle build/list/download/delete, permission checks (403s), path traversal guards (404s), and zip content verification
  • Zip content test asserts cluster.json, per-node thread-dump.txt/metrics.json/server.mem.log, and datanode-specific datanodes//datanode.log and datanodes//certificates.json
  • Test class annotated with @EnabledIfSearchServer(distribution = DATANODE) — runs against a datanode backend

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@todvora todvora changed the title Feature/support bundle tweaks Support bundle tweaks Apr 3, 2026
@todvora todvora changed the title Support bundle tweaks Improve SupportBundleService performance, stability, and test coverage Apr 3, 2026
@todvora todvora requested a review from Copilot April 3, 2026 10:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Support Bundle feature’s scalability by parallelizing node/cluster data collection, reducing memory pressure during log streaming, and adding full-backend integration tests to cover the end-to-end bundle lifecycle and permissions.

Changes:

  • Parallelized per-node and cluster-wide support bundle data collection via CompletableFuture to reduce latency at scale.
  • Reduced heap usage and improved streaming correctness when proxying datanode logs.
  • Added full-backend integration tests covering manifest/logfile endpoints and bundle build/list/download/delete flows (including security/path traversal checks).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java Refactors bundle collection to run more work concurrently; improves streaming and bundle listing implementation details.
full-backend-tests/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java Adds end-to-end backend tests for support bundle APIs, permissions, and downloaded zip contents.
Comments suppressed due to low confidence (1)

graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java:561

  • listBundles(): Files.list(bundleDir) may include directories; the previous implementation explicitly filtered them out. Without filtering, Files.size(f) can throw (and break listing) if any directory under bundleDir happens to start with the bundle prefix. Consider restoring a regular-file filter (e.g., !Files.isDirectory(p) / Files.isRegularFile(p)) before computing size.
    public List<BundleFile> listBundles() {
        try (var files = Files.list(bundleDir)) {
            return files
                    .filter(p -> p.getFileName().toString().startsWith(BUNDLE_NAME_PREFIX))
                    .map(f -> {
                        try {
                            return new BundleFile(f.getFileName().toString(), Files.size(f));
                        } catch (IOException e) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants