Skip to content

dd: page-align read buffer#12143

Open
chrboe wants to merge 1 commit intouutils:mainfrom
chrboe:dd-iflag-direct-aligned-buffer
Open

dd: page-align read buffer#12143
chrboe wants to merge 1 commit intouutils:mainfrom
chrboe:dd-iflag-direct-aligned-buffer

Conversation

@chrboe
Copy link
Copy Markdown

@chrboe chrboe commented May 4, 2026

The read scratch was a Vec<u8>, which the global allocator only aligns to align_of::<u8>() == 1. Block devices that expose a strict dma_alignment (e.g. loop, virtio_blk, nbd, zram, which all default to 511) reject O_DIRECT reads whose user buffer is below that alignment with EINVAL.

Replace the read scratch with AlignedBuf, a 4 KiB-aligned buffer backed by a Vec<AlignedPage> where AlignedPage is a #[repr(align(4096))] [u8; 4096]. Vec<T> aligns to align_of::<T>(), so this gets us page-aligned memory through the global allocator without posix_memalign and with a single unsafe block to expose the storage as &[u8].

fill_consecutive and fill_blocks now take &mut [u8] and report the buffer length (read bytes for the former; reads + padding for the latter) so the caller can slice without Vec::truncate. read_helper returns the data slice directly: into read_buf for the common path, or into a separate Vec scratch for conv=block / conv=unblock, which can change the byte count and so cannot be done in place.

Closes #12085.


Drafted with the help of an LLM. I admit I don't fully understand all the edge cases and details dd has to deal with, so there may still be changes required.

I hope this still serves as an aid to get discussion started around this issue.

In any case, this fixes the issue I originally reported in #12085.

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 4, 2026

Do you know aligned-vec which might reduce diff?

Comment thread src/uu/dd/src/dd.rs
// ------------------------------------------------------------------
// Read
// Resize the buffer to the bsize. Any garbage data in the buffer is overwritten or truncated, so there is no need to fill with BUF_INIT_BYTE first.
// resizing buf cause serious performance drop https://github.com/uutils/coreutils/issues/11544
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you fix the linked issue too?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link was removed because the buf.resize(...) call it referred to is gone. We now allocate the page-aligned buffer once at the start of dd_copy and slice it per iteration, so the per-iteration resize cost the comment was warning about no longer exists.

That said, this PR does not completely fix #11544: the one-time allocation still zero-fills the buffer up front, so bs=1G count=1 still pays the same ~770 ms it always did.

A real fix needs lazy/uninitialized allocation (e.g. Vec::<MaybeUninit<_>>::with_capacity + assume_init of the read prefix), but that would be a follow-up fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was exactly talking about 1st zero-fill.

@chrboe
Copy link
Copy Markdown
Author

chrboe commented May 4, 2026

Do you know aligned-vec which might reduce diff?

I did not explore that crate specifically. I briefly looked at aligned_buffer, but

  1. I assumed adding a dependency for this edge case would be discouraged by the project,
  2. it wouldn't reduce the diff by that much. We would save the AlignedPage and AlignedBuf definitions, the other churn would stay.

That being said, a genuine advantage of using an external crate for this would be that we probably wouldn't have to introduce any new unsafe blocks.

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 4, 2026

I think wc has similar code for aligned allocation (for SIMD scan?).
If we need such code for other utils too, it might be worth to add dep.

The read scratch was a `Vec<u8>`, which the global allocator only aligns
to `align_of::<u8>() == 1`. Block devices that expose a strict
`dma_alignment` (e.g. `loop`, `virtio_blk`, `nbd`, `zram`, which all
default to 511) reject O_DIRECT reads whose user buffer is below that
alignment with EINVAL.

Replace the read scratch with `AlignedBuf`, a 4 KiB-aligned buffer
backed by a `Vec<AlignedPage>` where `AlignedPage` is a
`#[repr(align(4096))] [u8; 4096]`. `Vec<T>` aligns to `align_of::<T>()`,
so this gets us page-aligned memory through the global allocator
without `posix_memalign` and with a single `unsafe` block to expose the
storage as `&[u8]`.

`fill_consecutive` and `fill_blocks` now take `&mut [u8]` and report
the buffer length (read bytes for the former; reads + padding for the
latter) so the caller can slice without `Vec::truncate`. `read_helper`
returns the data slice directly: into `read_buf` for the common path,
or into a separate `Vec` scratch for `conv=block` / `conv=unblock`,
which can change the byte count and so cannot be done in place.

Closes uutils#12085.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

GNU testsuite comparison:

GNU test failed: tests/misc/io-errors. tests/misc/io-errors is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/tail/retry is no longer failing!
Congrats! The gnu test tests/expand/bounded-memory is now passing!

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will improve performance by ×75

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 9 improved benchmarks
✅ 302 untouched benchmarks
⏩ 46 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory dd_copy_64k_blocks 78.1 KB 14.1 KB ×5.5
Memory dd_copy_4k_blocks 18.1 KB 14.1 KB +28.31%
Memory dd_copy_with_seek 18.4 KB 14.4 KB +27.71%
Memory dd_copy_separate_blocks 138.9 KB 122.9 KB +13.02%
Memory dd_copy_partial 18.1 KB 14.1 KB +28.29%
Memory dd_copy_1m_blocks 1,037.8 KB 13.8 KB ×75
Memory dd_copy_8k_blocks 22.1 KB 14.1 KB +56.62%
Memory dd_copy_default 14.8 KB 14.3 KB +3.5%
Memory dd_copy_with_skip 18.4 KB 14.4 KB +27.71%

Comparing chrboe:dd-iflag-direct-aligned-buffer (43a7e72) with main (379fbbb)

Open in CodSpeed

Footnotes

  1. 46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@chrboe chrboe force-pushed the dd-iflag-direct-aligned-buffer branch from 314b026 to 43a7e72 Compare May 4, 2026 13:21
@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 4, 2026

You need to fix spell-checker: ignore https://github.com/uutils/coreutils/actions/runs/25321443056/job/74231496060?pr=12143#step:5:22 .
(By the way, you saved many RAM. Nice!)

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 4, 2026

Hardcoding 4096 assumes a 4 KiB page size, which is not portable. I guess you need to obtain the system’s page size at runtime and align accordingly.

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 4, 2026

I don't think repr dynamically given size is possible. Is LCM of all targets OK?

@chrboe
Copy link
Copy Markdown
Author

chrboe commented May 4, 2026

Hardcoding 4096 assumes a 4 KiB page size, which is not portable. I guess you need to obtain the system’s page size at runtime and align accordingly.

Right, I referenced that fact in a comment:

/// Block devices that expose a strict `dma_alignment` (e.g. `loop`,
/// `virtio_blk`, `nbd`, `zram` — all default to 511 on Linux) reject
/// `O_DIRECT` reads whose user buffer is below that alignment, returning
/// `EINVAL`. 4 KiB alignment satisfies any `dma_alignment` mask up to 4095,
/// which covers every Linux block driver in mainline at the time of writing.
///
/// On systems with a page size larger than 4 KiB (PowerPC, some aarch64
/// kernel configurations) GNU dd allocates a page-aligned buffer via
/// `posix_memalign(getpagesize())`. We do not — `#[repr(align)]` requires a
/// literal alignment, so this is a fixed compile-time choice. Such systems
/// are not in the project CI matrix (see `.github/workflows/CICD.yml`); if
/// support for them is added, this should be revisited.

I assumed that compatibility with such relatively exotic architectures isn't a top priority. Do you have a specific configuration in mind where this would break?

Of course, going the GNU way and fetching the page size at runtime is doable, but it costs us some more unsafe blocks.

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 4, 2026

aarch64-apple-darwin isn't so exotic, but maybe we only need the alignment on Linux?

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 4, 2026

Can you add test? Also needed to fix spell-checker: ignore.

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

I think the focus on "page size alignment" is a red herring here. The kernel’s page size isn’t the relevant constraint, what matters is alignment to the device’s sector size.

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 6, 2026

what matters is alignment to the device’s sector size.

In any cases, 4096 bytes seems correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dd iflag=direct reads fail with "IO error: Invalid input" on devices with strict dma_alignment

3 participants