Skip to content

tee: improve throughput by raw syscall#12131

Open
oech3 wants to merge 1 commit intouutils:mainfrom
oech3:tee-throughput
Open

tee: improve throughput by raw syscall#12131
oech3 wants to merge 1 commit intouutils:mainfrom
oech3:tee-throughput

Conversation

@oech3
Copy link
Copy Markdown
Contributor

@oech3 oech3 commented May 3, 2026

Remove flush overhead and use raw syscall for POSIX tee requirement.

$ truncate -s 1PB huge
$ gnu9.11/tee<huge|pv>/dev/null
7.13GiB 0:00:02 [3.61GiB/s]
$ target/release/tee<huge|pv>/dev/null
2.56GiB 0:00:01 [2.56GiB/s]
$ target/release/tee-raw<huge|pv>/dev/null
3.35GiB 0:00:01 [3.35GiB/s]

$ cat huge|gnu9.11/tee|pv>/dev/null
2.15GiB 0:00:01 [2.15GiB/s]
$ cat huge|target/release/tee|pv>/dev/null
2.31GiB 0:00:01 [2.31GiB/s]
$ cat huge|target/release/tee-raw|pv>/dev/null
3.00GiB 0:00:01 [3.00GiB/s]

I was really surprised that std::io has such overhead.

@oech3 oech3 marked this pull request as ready for review May 3, 2026 09:07
@sylvestre
Copy link
Copy Markdown
Contributor

is it possible to have a codspeed benchmark to evaluate the win?

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 3, 2026

I don't know how to connect stdin to CodSpeed.

@sylvestre
Copy link
Copy Markdown
Contributor

i asked on discord

@oech3

This comment was marked as outdated.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 3, 2026

We should also support syscall failure on I/O benchmarks (inparticular for splice(2)).
Some benches with splice(2) code path are invalid for other platforms.

Comment thread src/uu/tee/src/tee.rs Outdated
@oech3

This comment was marked as outdated.

@oech3 oech3 force-pushed the tee-throughput branch from 76b5121 to 6dc2d9f Compare May 3, 2026 14:06
@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 3, 2026

It might syscall-time bounded bench. If so, we cannot use CodSpeed.
But I made #12134 anyway.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

GNU testsuite comparison:

GNU test failed: tests/tail/pipe-f2. tests/tail/pipe-f2 is passing on 'main'. Maybe you have to rebase?
Skipping an intermittent issue tests/date/date-locale-hour (passes in this run but fails in the 'main' branch)

Comment thread src/uu/tee/src/tee.rs Outdated
}
}

#[cfg(not(unix))] // todo: investigate how to improve throughput
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unsafe is needed if we want do same things for for WIndows.

@oech3 oech3 force-pushed the tee-throughput branch 2 times, most recently from 72a2adc to e81feb1 Compare May 5, 2026 11:11
@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

The Write implementation is somewhat unintuitive; it could be removed in favor of a simpler approach with a single method:

fn write(&mut self, buf: &[u8]) -> Result<()>

Additionally, avoiding Write may be preferable if we plan to support a MaybeUninit buffer in the future.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

I don't want to manually write loop writing all of buf.

rustix::io::read can generate uninit safely by splitting init part, but rustix::io::write cannot. So uninit is unrelated for Writer.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

This impl is because rustix::io::write does not support writing to stdout on Windows. If we can override write by similar way, it is preffered.

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

I still think its's confusing to implement Write on Unix, but not Windows...

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

I can add comment instead. We need 2 duplicated write_all definitiols for Stdout anf File if we manually loop.

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

I was thinking something like this: b0edaae

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

No. Because write syscall does not guarantee writing everything. Needs loop and take sum of returned values.
(also rustix::io::write_all is not provided...)

@oech3 oech3 force-pushed the tee-throughput branch from e81feb1 to 687a7cb Compare May 5, 2026 15:18
@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

It would solve many problems if we could use std::io::stdio::StdoutRaw. Perhaps this functionality could be implemented as a crate.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

Private? Is it need to copy and paste unsafe code to uucore?

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

The issue we’re working around is that std::io::Stdout uses buffering, which doesn’t match our requirements. Ideally, we should provide our own Stdout implementation that performs unbuffered I/O instead. But that would be a much larger piece of work.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented May 5, 2026

We already have a solution in uucore::io::OwnedFileDescriptorOrHandle, see #10235.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

Hmm... it has try_clone. So introducing new overhead?

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

We cannot use it #12157

@oech3 oech3 force-pushed the tee-throughput branch from 687a7cb to bcca96f Compare May 5, 2026 22:04
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 5, 2026

Merging this PR will improve performance by ×17

⚡ 1 improved benchmark
✅ 312 untouched benchmarks
⏩ 46 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation tee_stdin_file[10000000] 3,970.9 µs 233 µs ×17

Comparing oech3:tee-throughput (bcca96f) with main (2e04477)

Open in CodSpeed

Footnotes

  1. 46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 5, 2026

+syscall time is still better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants