dd: page-align read buffer#12143
Conversation
|
Do you know aligned-vec which might reduce diff? |
| // ------------------------------------------------------------------ | ||
| // Read | ||
| // Resize the buffer to the bsize. Any garbage data in the buffer is overwritten or truncated, so there is no need to fill with BUF_INIT_BYTE first. | ||
| // resizing buf cause serious performance drop https://github.com/uutils/coreutils/issues/11544 |
There was a problem hiding this comment.
Did you fix the linked issue too?
There was a problem hiding this comment.
The link was removed because the buf.resize(...) call it referred to is gone. We now allocate the page-aligned buffer once at the start of dd_copy and slice it per iteration, so the per-iteration resize cost the comment was warning about no longer exists.
That said, this PR does not completely fix #11544: the one-time allocation still zero-fills the buffer up front, so bs=1G count=1 still pays the same ~770 ms it always did.
A real fix needs lazy/uninitialized allocation (e.g. Vec::<MaybeUninit<_>>::with_capacity + assume_init of the read prefix), but that would be a follow-up fix.
There was a problem hiding this comment.
The comment was exactly talking about 1st zero-fill.
I did not explore that crate specifically. I briefly looked at aligned_buffer, but
That being said, a genuine advantage of using an external crate for this would be that we probably wouldn't have to introduce any new |
|
I think |
The read scratch was a `Vec<u8>`, which the global allocator only aligns to `align_of::<u8>() == 1`. Block devices that expose a strict `dma_alignment` (e.g. `loop`, `virtio_blk`, `nbd`, `zram`, which all default to 511) reject O_DIRECT reads whose user buffer is below that alignment with EINVAL. Replace the read scratch with `AlignedBuf`, a 4 KiB-aligned buffer backed by a `Vec<AlignedPage>` where `AlignedPage` is a `#[repr(align(4096))] [u8; 4096]`. `Vec<T>` aligns to `align_of::<T>()`, so this gets us page-aligned memory through the global allocator without `posix_memalign` and with a single `unsafe` block to expose the storage as `&[u8]`. `fill_consecutive` and `fill_blocks` now take `&mut [u8]` and report the buffer length (read bytes for the former; reads + padding for the latter) so the caller can slice without `Vec::truncate`. `read_helper` returns the data slice directly: into `read_buf` for the common path, or into a separate `Vec` scratch for `conv=block` / `conv=unblock`, which can change the byte count and so cannot be done in place. Closes uutils#12085.
|
GNU testsuite comparison: |
Merging this PR will improve performance by ×75
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | dd_copy_64k_blocks |
78.1 KB | 14.1 KB | ×5.5 |
| ⚡ | Memory | dd_copy_4k_blocks |
18.1 KB | 14.1 KB | +28.31% |
| ⚡ | Memory | dd_copy_with_seek |
18.4 KB | 14.4 KB | +27.71% |
| ⚡ | Memory | dd_copy_separate_blocks |
138.9 KB | 122.9 KB | +13.02% |
| ⚡ | Memory | dd_copy_partial |
18.1 KB | 14.1 KB | +28.29% |
| ⚡ | Memory | dd_copy_1m_blocks |
1,037.8 KB | 13.8 KB | ×75 |
| ⚡ | Memory | dd_copy_8k_blocks |
22.1 KB | 14.1 KB | +56.62% |
| ⚡ | Memory | dd_copy_default |
14.8 KB | 14.3 KB | +3.5% |
| ⚡ | Memory | dd_copy_with_skip |
18.4 KB | 14.4 KB | +27.71% |
Comparing chrboe:dd-iflag-direct-aligned-buffer (43a7e72) with main (379fbbb)
Footnotes
-
46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
314b026 to
43a7e72
Compare
|
You need to fix |
|
Hardcoding 4096 assumes a 4 KiB page size, which is not portable. I guess you need to obtain the system’s page size at runtime and align accordingly. |
|
I don't think repr dynamically given size is possible. Is LCM of all targets OK? |
Right, I referenced that fact in a comment: /// Block devices that expose a strict `dma_alignment` (e.g. `loop`,
/// `virtio_blk`, `nbd`, `zram` — all default to 511 on Linux) reject
/// `O_DIRECT` reads whose user buffer is below that alignment, returning
/// `EINVAL`. 4 KiB alignment satisfies any `dma_alignment` mask up to 4095,
/// which covers every Linux block driver in mainline at the time of writing.
///
/// On systems with a page size larger than 4 KiB (PowerPC, some aarch64
/// kernel configurations) GNU dd allocates a page-aligned buffer via
/// `posix_memalign(getpagesize())`. We do not — `#[repr(align)]` requires a
/// literal alignment, so this is a fixed compile-time choice. Such systems
/// are not in the project CI matrix (see `.github/workflows/CICD.yml`); if
/// support for them is added, this should be revisited.
I assumed that compatibility with such relatively exotic architectures isn't a top priority. Do you have a specific configuration in mind where this would break? Of course, going the GNU way and fetching the page size at runtime is doable, but it costs us some more unsafe blocks. |
|
|
|
Can you add test? Also needed to fix |
|
I think the focus on "page size alignment" is a red herring here. The kernel’s page size isn’t the relevant constraint, what matters is alignment to the device’s sector size. |
In any cases, 4096 bytes seems correct. |
The read scratch was a
Vec<u8>, which the global allocator only aligns toalign_of::<u8>() == 1. Block devices that expose a strictdma_alignment(e.g.loop,virtio_blk,nbd,zram, which all default to 511) reject O_DIRECT reads whose user buffer is below that alignment with EINVAL.Replace the read scratch with
AlignedBuf, a 4 KiB-aligned buffer backed by aVec<AlignedPage>whereAlignedPageis a#[repr(align(4096))] [u8; 4096].Vec<T>aligns toalign_of::<T>(), so this gets us page-aligned memory through the global allocator withoutposix_memalignand with a singleunsafeblock to expose the storage as&[u8].fill_consecutiveandfill_blocksnow take&mut [u8]and report the buffer length (read bytes for the former; reads + padding for the latter) so the caller can slice withoutVec::truncate.read_helperreturns the data slice directly: intoread_buffor the common path, or into a separateVecscratch forconv=block/conv=unblock, which can change the byte count and so cannot be done in place.Closes #12085.
Drafted with the help of an LLM. I admit I don't fully understand all the edge cases and details
ddhas to deal with, so there may still be changes required.I hope this still serves as an aid to get discussion started around this issue.
In any case, this fixes the issue I originally reported in #12085.