dd: page-align read buffer by chrboe · Pull Request #12143 · uutils/coreutils

chrboe · 2026-05-04T12:29:41Z

The read scratch was a Vec<u8>, which the global allocator only aligns to align_of::<u8>() == 1. Block devices that expose a strict dma_alignment (e.g. loop, virtio_blk, nbd, zram, which all default to 511) reject O_DIRECT reads whose user buffer is below that alignment with EINVAL.

Replace the read scratch with AlignedBuf, a 4 KiB-aligned buffer backed by a Vec<AlignedPage> where AlignedPage is a #[repr(align(4096))] [u8; 4096]. Vec<T> aligns to align_of::<T>(), so this gets us page-aligned memory through the global allocator without posix_memalign and with a single unsafe block to expose the storage as &[u8].

fill_consecutive and fill_blocks now take &mut [u8] and report the buffer length (read bytes for the former; reads + padding for the latter) so the caller can slice without Vec::truncate. read_helper returns the data slice directly: into read_buf for the common path, or into a separate Vec scratch for conv=block / conv=unblock, which can change the byte count and so cannot be done in place.

Closes #12085.

Drafted with the help of an LLM. I admit I don't fully understand all the edge cases and details dd has to deal with, so there may still be changes required.

I hope this still serves as an aid to get discussion started around this issue.

In any case, this fixes the issue I originally reported in #12085.

oech3 · 2026-05-04T12:31:15Z

Do you know aligned-vec which might reduce diff?

oech3 · 2026-05-04T12:32:46Z

-    // ------------------------------------------------------------------
-    // Read
-    // Resize the buffer to the bsize. Any garbage data in the buffer is overwritten or truncated, so there is no need to fill with BUF_INIT_BYTE first.
-    // resizing buf cause serious performance drop https://github.com/uutils/coreutils/issues/11544


Did you fix the linked issue too?

The link was removed because the buf.resize(...) call it referred to is gone. We now allocate the page-aligned buffer once at the start of dd_copy and slice it per iteration, so the per-iteration resize cost the comment was warning about no longer exists.

That said, this PR does not completely fix #11544: the one-time allocation still zero-fills the buffer up front, so bs=1G count=1 still pays the same ~770 ms it always did.

A real fix needs lazy/uninitialized allocation (e.g. Vec::<MaybeUninit<_>>::with_capacity + assume_init of the read prefix), but that would be a follow-up fix.

The comment was exactly talking about 1st zero-fill.

chrboe · 2026-05-04T12:36:07Z

Do you know aligned-vec which might reduce diff?

I did not explore that crate specifically. I briefly looked at aligned_buffer, but

I assumed adding a dependency for this edge case would be discouraged by the project,
it wouldn't reduce the diff by that much. We would save the AlignedPage and AlignedBuf definitions, the other churn would stay.

That being said, a genuine advantage of using an external crate for this would be that we probably wouldn't have to introduce any new unsafe blocks.

oech3 · 2026-05-04T12:41:09Z

I think wc has similar code for aligned allocation (for SIMD scan?).
If we need such code for other utils too, it might be worth to add dep.

The read scratch was a `Vec<u8>`, which the global allocator only aligns to `align_of::<u8>() == 1`. Block devices that expose a strict `dma_alignment` (e.g. `loop`, `virtio_blk`, `nbd`, `zram`, which all default to 511) reject O_DIRECT reads whose user buffer is below that alignment with EINVAL. Replace the read scratch with `AlignedBuf`, a 4 KiB-aligned buffer backed by a `Vec<AlignedPage>` where `AlignedPage` is a `#[repr(align(4096))] [u8; 4096]`. `Vec<T>` aligns to `align_of::<T>()`, so this gets us page-aligned memory through the global allocator without `posix_memalign` and with a single `unsafe` block to expose the storage as `&[u8]`. `fill_consecutive` and `fill_blocks` now take `&mut [u8]` and report the buffer length (read bytes for the former; reads + padding for the latter) so the caller can slice without `Vec::truncate`. `read_helper` returns the data slice directly: into `read_buf` for the common path, or into a separate `Vec` scratch for `conv=block` / `conv=unblock`, which can change the byte count and so cannot be done in place. Closes uutils#12085.

github-actions · 2026-05-04T13:06:32Z

GNU testsuite comparison:

GNU test failed: tests/misc/io-errors. tests/misc/io-errors is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/tail/retry is no longer failing!
Congrats! The gnu test tests/expand/bounded-memory is now passing!

codspeed-hq · 2026-05-04T13:11:21Z

Merging this PR will improve performance by ×75

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 9 improved benchmarks
✅ 302 untouched benchmarks
⏩ 46 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Memory	`dd_copy_64k_blocks`	78.1 KB	14.1 KB	×5.5
⚡	Memory	`dd_copy_4k_blocks`	18.1 KB	14.1 KB	+28.31%
⚡	Memory	`dd_copy_with_seek`	18.4 KB	14.4 KB	+27.71%
⚡	Memory	`dd_copy_separate_blocks`	138.9 KB	122.9 KB	+13.02%
⚡	Memory	`dd_copy_partial`	18.1 KB	14.1 KB	+28.29%
⚡	Memory	`dd_copy_1m_blocks`	1,037.8 KB	13.8 KB	×75
⚡	Memory	`dd_copy_8k_blocks`	22.1 KB	14.1 KB	+56.62%
⚡	Memory	`dd_copy_default`	14.8 KB	14.3 KB	+3.5%
⚡	Memory	`dd_copy_with_skip`	18.4 KB	14.4 KB	+27.71%

_{Comparing chrboe:dd-iflag-direct-aligned-buffer (43a7e72) with main (379fbbb)}

46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

oech3 · 2026-05-04T14:15:33Z

You need to fix spell-checker: ignore https://github.com/uutils/coreutils/actions/runs/25321443056/job/74231496060?pr=12143#step:5:22 .
(By the way, you saved many RAM. Nice!)

xtqqczze · 2026-05-04T14:59:16Z

Hardcoding 4096 assumes a 4 KiB page size, which is not portable. I guess you need to obtain the system’s page size at runtime and align accordingly.

oech3 · 2026-05-04T15:04:17Z

I don't think repr dynamically given size is possible. Is LCM of all targets OK?

chrboe · 2026-05-04T15:05:18Z

Hardcoding 4096 assumes a 4 KiB page size, which is not portable. I guess you need to obtain the system’s page size at runtime and align accordingly.

Right, I referenced that fact in a comment:

/// Block devices that expose a strict `dma_alignment` (e.g. `loop`,
/// `virtio_blk`, `nbd`, `zram` — all default to 511 on Linux) reject
/// `O_DIRECT` reads whose user buffer is below that alignment, returning
/// `EINVAL`. 4 KiB alignment satisfies any `dma_alignment` mask up to 4095,
/// which covers every Linux block driver in mainline at the time of writing.
///
/// On systems with a page size larger than 4 KiB (PowerPC, some aarch64
/// kernel configurations) GNU dd allocates a page-aligned buffer via
/// `posix_memalign(getpagesize())`. We do not — `#[repr(align)]` requires a
/// literal alignment, so this is a fixed compile-time choice. Such systems
/// are not in the project CI matrix (see `.github/workflows/CICD.yml`); if
/// support for them is added, this should be revisited.

I assumed that compatibility with such relatively exotic architectures isn't a top priority. Do you have a specific configuration in mind where this would break?

Of course, going the GNU way and fetching the page size at runtime is doable, but it costs us some more unsafe blocks.

xtqqczze · 2026-05-04T15:21:00Z

aarch64-apple-darwin isn't so exotic, but maybe we only need the alignment on Linux?

oech3 · 2026-05-04T16:08:08Z

Can you add test? Also needed to fix spell-checker: ignore.

xtqqczze · 2026-05-05T09:28:08Z

I think the focus on "page size alignment" is a red herring here. The kernel’s page size isn’t the relevant constraint, what matters is alignment to the device’s sector size.

oech3 · 2026-05-06T03:53:16Z

what matters is alignment to the device’s sector size.

In any cases, 4096 bytes seems correct.

oech3 reviewed May 4, 2026

View reviewed changes

chrboe force-pushed the dd-iflag-direct-aligned-buffer branch from 314b026 to 43a7e72 Compare May 4, 2026 13:21

Uh oh!

Conversation

chrboe commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oech3 commented May 4, 2026

Uh oh!

oech3 May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chrboe May 4, 2026

Choose a reason for hiding this comment

Uh oh!

oech3 May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chrboe commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oech3 commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×75

Performance Changes

Footnotes

Uh oh!

oech3 commented May 4, 2026

Uh oh!

xtqqczze commented May 4, 2026

Uh oh!

oech3 commented May 4, 2026

Uh oh!

chrboe commented May 4, 2026

Uh oh!

xtqqczze commented May 4, 2026

Uh oh!

oech3 commented May 4, 2026

Uh oh!

xtqqczze commented May 5, 2026

Uh oh!

oech3 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chrboe commented May 4, 2026 •

edited

Loading

chrboe commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

codspeed-hq Bot commented May 4, 2026 •

edited

Loading