Skip to content

feat: get_ranges 2x performance degradation #7380

Open
comphead wants to merge 13 commits intoapache:mainfrom
comphead:main
Open

feat: get_ranges 2x performance degradation #7380
comphead wants to merge 13 commits intoapache:mainfrom
comphead:main

Conversation

@comphead
Copy link
Copy Markdown
Contributor

@comphead comphead commented Apr 12, 2026

Which issue does this PR close?

Closes #7383

In Apache DataFusion Comet we figured out that HDFS access slow down dramatically, leading HDFS tasks performance to degrade 2x likely after #7192

Original issue apache/datafusion-comet#3926

Rationale for this change

Summary

  • Concurrent range reads in get_ranges: Replace the sequential per-range read loop with concurrent reads using futures::stream::buffered(8), improving wall-clock latency for multi-range reads (e.g., Parquet rowgroups from HDFS)
  • No stat() call: The existing get_ranges already avoids stat(); this preserves that by using a single reader created once per call
  • Exact-sized fetches: Each range fetches only the requested bytes — no coalescing/over-fetching that would retain large backing buffers in memory

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. releases-note/feat The PR implements a new feature or has a title that begins with "feat" size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 12, 2026
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 12, 2026
@comphead comphead marked this pull request as draft April 13, 2026 00:40
@comphead
Copy link
Copy Markdown
Contributor Author

Putting this to draft, as adding more parallelism increases significantly the memory usage

@comphead comphead changed the title feat: read HDFS ranges concurrently feat: get_ranges 2x performance degradation Apr 13, 2026
@comphead
Copy link
Copy Markdown
Contributor Author

The last revision seems returned the performance to what it was before, perhaps slightly faster because of not calling stats() and memory usage is moderate

@Xuanwo @kszucs PTAL

@comphead comphead marked this pull request as ready for review April 14, 2026 03:07
.map_err(|err| format_object_store_error(err, &location_ref))
}
})
.buffered(10)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 paralellism also used in coalesce_ranges which have been called in original object store get_ranges

@morristai
Copy link
Copy Markdown
Member

I think the current change improves get_ranges latency by issuing per-range reads concurrently, but it still performs one backend read per requested range.

reader.read(range) still goes through OpenDAL's normal read path, so for HDFS this still means a fresh file open / read pipeline per range. In other words, this changes N sequential reads into up to N concurrent reads, but it doesn't coalesce nearby ranges or reduce the total number of HDFS opens / NameNode RPCs.

That seems different from the PR description, which suggests this should turn something like 50 range reads into ~5 merged reads. The earlier regression looks more like we lost object_store's default get_ranges coalescing when #7192 added the custom get_range / get_ranges path to avoid stat().

So this patch may still help wall-clock time, but I don't think it fixes the root cause if the goal is to actually reduce HDFS opens. That likely needs a merged-range path again, e.g. something based on OpenDAL's range-fetch/merge logic instead of buffering individual reader.read(range) calls.

@morristai morristai self-requested a review April 14, 2026 04:24
results.push(data);
}
Ok(results)
let location_ref: Arc<str> = Arc::from(location.as_ref());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just call fetch + concurrent here?

@comphead
Copy link
Copy Markdown
Contributor Author

Thanks @morristai and @Xuanwo let me try to get back the coalesce_ranges as it was in original object_store implementation

@comphead
Copy link
Copy Markdown
Contributor Author

Thanks @morristai and @Xuanwo let me try to get back the coalesce_ranges as it was in original object_store implementation

coalesce_ranges explodes memory usage at least twice

@comphead
Copy link
Copy Markdown
Contributor Author

Rolled back to without using coalesce_ranges

object_store::coalesce_ranges merges nearby ranges (gap < 1MB) into larger fetches. For HDFS files with multiple rowgroups, this can merge separate rowgroups into a single ~full-file read. The returned Bytes::slice() views retain the full coalesced backing buffer via reference counting, causing ~2x
memory usage.

On our cluster with 1200 files × 140MB across 30 machines, this increased total memory from ~0.5T to ~1T. The simple per-range concurrent approach fetches exactly what's needed with no over-allocation.

This PR can act like a patch to fix the performance having relatively same memory utilization, and happy to participate in longer term solution in follow up PR if needed

@comphead comphead requested a review from Xuanwo April 14, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

releases-note/feat The PR implements a new feature or has a title that begins with "feat" size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: HDFS get_ranges performance 2x degradation

3 participants