Add HIP 0001 ring buffer proposal for Hyperlight I/O by andreiltd · Pull Request #1112 · hyperlight-dev/hyperlight

andreiltd · 2025-12-12T14:08:09Z

jsturtevant · 2025-12-16T20:58:39Z

+
+### Risks and Mitigations
+
+**Risk**: Malicious guest corrupting the queue **Mitigation**: Do not expose low level queue API to


I don't know that not exposing it to the guest does anything. The guest is fully untrusted and so we need to do the mitigation suggested after this, that we serialize all data as known types and do assertions as suggested.

jsturtevant · 2025-12-16T21:06:17Z

+example `ioeventfd` for kvm). This is especially useful for streaming scenarios where the guest can
+continue processing while the host consumes data asynchronously.
+
+**3. Inline Descriptors**


Is this proposed to be implemented now or as possible improvement in the future

I would like to give it a try in the first draft, but really if we're gonna keep it or not depends on two factors: 1) can we actually fit any meaningful data into inline slot, e.g. can serialized function call with single argument fit? 2) the performance improvement for such calls is substantial. Both are hard to assess without actual implementation.

jsturtevant · 2025-12-16T21:08:33Z

+2. Device writes up to buffer length
+3. Device sets actual written length in descriptor
+4. If `actual_length > buffer_length`, device sets a `TRUNCATED` flag,
+5. Driver can re-submit with larger buffer if needed


I can image this would be possible performance issue? Can we provide a metric for when this occurs and a way to improve initial buffer size?

That's a good point. I would imagine that this only happens when you try to transport a "big" chunk of data for which the base case should be using stream rather then pushing everything at once. That being said the protocol should support full round trip but really the mitigation of this problem should be using streams.

jsturtevant · 2025-12-16T21:10:12Z

+### Snapshotting
+
+Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any
+attempt to snapshot a sandbox with such pending requests will result in a snapshot failure.


Can we ensure this?

Yes, we can query the queue for number of inflight messages during snapshotting and fail if it reports positive number. Because we have to track the number of free slots in the queue getting number of inflight messages is just length of the queue minus free slots.

How does this look like for the user?
That means that if you have a busy guest, that keeps producing messages, the host side needs to somehow coordinate with it and stop/finish processing before creating a snapshot, right?

jsturtevant

This looks great, thanks for such a detailed write up. Left a few minor comments but otherwise this seems like it would set up the project to work well for the future

Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>

Copilot

Pull request overview

This pull request adds HIP 0001, a comprehensive proposal document for implementing a virtio-inspired ring buffer mechanism for Hyperlight I/O. The proposal aims to replace the current stack-based communication model with a more efficient ring buffer approach to reduce VM exits and enable streaming communication patterns.

Changes:

Adds a detailed technical proposal document describing the ring buffer design, API, and implementation plan
Updates typos.toml to allow "readables" and "writables" as valid terms used in the proposal

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File	Description
proposals/0001-rng-buf/README.md	Comprehensive HIP document describing virtio-inspired ring buffer design with memory layout, API examples, and implementation plan
typos.toml	Adds exceptions for "readables" and "writables" terms used in the proposal document

danbugs · 2026-01-14T02:05:09Z

Really great work on this HIP, @andreiltd! Loved seeing the diagrams, extensive testing plan, code, and whatnot.

A few thoughts:

1. Side-by-side comparison suggestion

I think the HIP could benefit from a more explicit side-by-side comparison in the "Comparison with current implementation" section. Something like a table showing:

Current: push_shared_output_data()-> outb -> VM Exit -> pop_shared_input_data() (per call)
Proposed: submit_available() × N -> notify() → VM Exit → poll_used() × N (batched)

I feel like this would help readers quickly grasp the key difference in the two approaches.

2. Related work in Nanvix fork

I've done some similar work in the Nanvix hyperlight fork that might be relevant: 76c9e7c

There, I added a credits_value field to HyperlightPEB that the guest can check before performing a VMExit. This way, the guest avoids unnecessary exits when the host signals no resources are available. This is similar to ring buffer notification suppression / checking num_free() before submitting, so I feel like Hyperlight-Nanvix could really benefit from this work!

3. Single-threaded vs multi-threaded phasing

For the multi-threaded case (guest and host on separate threads), the host would need a way to interrupt the running guest when responses are ready--yes? For that, I've also done related work for that in Nanvix's Hyperlight fork: a897cad for KVM.

My hardware interrupts work enables PIC/APIC interrupt injection which could support the ioeventfd-style notifications mentioned in the HIP. Happy to coordinate on this when the time comes :)

The backward compatibility goal (current function call model portable without public API changes) is really cool to see--great for downstream consumers like Nanvix.

Looking forward to seeing this land!

Fix typos Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tomasz Andrzejak <andreiltd@users.noreply.github.com>

Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>

syntactically

I mostly really like this! Left a few minor nits, and a couple of bigger comments clarifying the question of where buffers get allocated.

I didn't review the Rust type design in detail, as I understand that is going to be updated shortly.

syntactically · 2026-01-22T14:08:20Z

+## Design Details
+
+### Packed Virtqueue Overview
+


I don't love the producer/consumer terminology here. I think it added to the confusion before about how we should have buffer allocation responsibility be divided. How about something like "the driver is the side that can allocate buffers accessible by both sides (i.e. the guest), and the device is the side that reads from/writes to these buffers on request (i.e. the host). The host is essentially providing a specialised paravirtualised device to the guest, and so always takes on the device role.

That's fair, especially that both sides devuce and driver involve producing and consuming descriptors and even more true if we want to maintain that only the guest allocates shared buffers. I will change this.

syntactically · 2026-01-22T14:08:20Z

+can act as driver or device depending on the direction of communication.
+
+#### Bidirectional Communication
+


How about something like:

Hyperlight needs to support both device-initiated (for host->guest function calls) and driver-initiated (for guest->host function calls) communication. This is generally accomplished by having the guest create during early initialisation and keep populated a set of available buffers into which the host should next write incoming work for the guest.

I am not fundamentally opposed to having split rx/tx queues, like e.g. virtio-net, but unsure if they actually make sense here, at least in the single-threaded case?

Single queue might be a cleaner solution but I will need to think harder what it means for backpressure. I will do it here.

Just to not add to the confusion: by split we mean 2 separate queues for each direction and it has nothing to do with virtio terms like split vs packed queue.

In the split queue scenario the backpressure is simple because it is independent in each direction. For single queue on the other hand the backpressure is shared because both directions compete for the same descriptor ring capacity. SO for example, If the guest fills all its available slots with requests, and the host needs to send something to process those requests, this will result in deadlock - so we need to have a mechanism to avoid this.

Maintaining 50/50 split is simple but it's going to be very wasteful for the asymmetric traffic. We could maintain a minimum number of slots that are always available for the device and whenever host mark them used, the driver "immediately" resubmits fresh writable buffers. The explicit pool still introduces a bias -- so should the device be allowed to "request more credits" if it needs increase capacity? This becomes quickly quiet complicated.

So I went ahead and tried to implement function calls with virtq while maintaining guest buffer allocation only.

For mvp I want to replace stack-based function calls with virtqueues, so we need to handle bidirectional RPC:

Guest -> Host calls: Guest initiates, Host processes, returns result

Host -> Guest calls: Host initiates, Guest processes, returns result

Let's consider using a separate H2G for Host -> Guest calls with more or less this flow:

Guest prefills H2G queue with writable request buffers

Host takes a buffer, writes FunctionCall request, marks USED

Guest sees USED, processes request, sends response back...

The problem is: how does Guest send the response?

Response on Same H2G Queue

H2G Queue State (after prefill): [A:req0][A:req1][A:req2]...[A:req255] <- All AVAIL

Host takes req0, fills with request, marks USED: [U:req0][A:req1][A:req2]...[A:req255]

Guest sees USED, processes, submits response: [A:resp0][A:req1][A:req2]...[A:req255] <- Response at END

The problem here is that virtq reads AVAIL descriptors in order. To read resp0, host must skip past all remaining prefills (req1..req255).

We could think about only keeping a single prefill, but even with 1 prefill, batching (e.g., 5 requests at once) requires 5 prefills. Any N from 1 to 256 suffers the same "skip ahead" problem.

What about sending the request using H2G prefills but guest will use a G2H queue for sending responses?

Guest prefills with writable buffers [A:request_slot]

Host writes request, marks USED

Guest reads request, processes it

Guest sends response via G2H (not H2G)

Response type has to be indicated in message content. Further, it will require having Request/Response correlation id that is outside the virtq. There is also a possible deadlock scneario, similar to what had been discussed earlier:

Host sends H2G request (request_id=42)

Host blocks waiting for response

Guest processes request

G2H is FULL of pending guest -> host requests

Guest can't send response (G2H blocked)

DEADLOCK: Host waiting for response, not draining G2H

The design is not very clean because H2G queue has mixed purpose: accommodating for 2 kind of responses.

If we absolutely need to maintain the assumption that only guest can allocate buffers for initiated requests we will need 3 queues (possibly 4 for symmetry).

Three queues, each with a single purpose:

G2H: Guest -> Host calls with response in same chain [R:request + W:response]

H2G_REQ: Host -> Guest requests only (guest prefills [W:buffer], host writes request)

H2G_RESP: Guest -> Host responses to H2G calls only [R:response with request_id]

No message type disambiguation needed. No deadlock possible H2G_RESP is never blocked by unrelated traffic. Host polls H2G_RESP for responses without draining other queues.

Do I understand correctly that you experimented with a single Guest side queue and ran into the "skip ahead" problem, tried two queues and ran into a deadlock? Having 3 queues was essentially the way to avoid this?

With the split queues (as written now) we don't run into these issues?

Pretty much yes. The proposal assumes that for host to guest communication, host acts as a driver but that will require to allocate buffers on the host side for the guest. That is logically the cleanest solution. If we want guest allocation only, the cleanest solution will be to add dedicated queue for h2g responses:

G2H: Guest -> Host calls with response in same chain [R:request + W:response]

H2G_REQ: Host -> Guest requests only (guest prefills [W:buffer], host writes request)

H2G_RESP: Guest -> Host responses to H2G calls only [R:response with request_id]

One extra queue.

What is the benefit of having the guest controlled queue vs shared (as proposed now)? Trying to get a grasp of the trade offs

A bit late here, I've missed this comment. I think @syntactically can better explain pros and cons, but how I understand it is that the biggest practical benefit of of guest-controlled allocation is that it keeps the memory model simple - everything lives in guest memory, the host just translates and validates addresses. So there is no need for special regions like we have today: input, output stack and all buffers live in guest memory regardless of direction. That being said, based on the analysis above I think simpler memory model is a tradeoff and the complexity will reappear in H2G communication.

Let's move forward with the HIP, i think the general direction is agreed. We can mark this a "Uknowns/TBD", flush out implementations and update the HIP later.

I propose we bring this up at the community meeting on Wednesday

syntactically · 2026-01-22T14:08:20Z

+
+#### Memory Layout
+
+The queue structure reside in shared memory and are accessible to both host and guest. The layout


Should this be an offset from the mapped region base, or a GPA? I would tend to think GPA, even though in most cases the host will want to ensure that it is in the scratch region and then convert to an offset from there.

Yup, GPA is cleaner.

syntactically · 2026-01-22T14:08:20Z

+
+<img width="1324" height="726" alt="submit" src="https://github.com/user-attachments/assets/a3ee2dea-a55b-4d50-8b96-1702617a21f0" />
+
+**Step 2:** Device processes and writes response


I assume that in this picture there is implicitly some header inside the buffers passed where some kind of call ID allocated by the caller can be used to link up calls/responses?

So there is an id in the descriptor, but that identifies the buffer, not the semantic call. But I don't think there is a need for the payload level id: we can just return the opaque submission token that wraps descriptor id that is used to correlate resp/req.

syntactically · 2026-01-22T14:08:20Z

+descriptor layer or in the flatbuffer schema:
+
+1. Driver allocates buffer of estimated size
+2. Device writes up to buffer length


Is this flag set in the queue flags, or in the header protocol inside of the buffers? The latter I assume needs to keep track of both a request id and an offset from the beginning of the response to that request for this buffer, for this to work?

Good point, I would imagine this has to be protocol level as this is not really about the descriptor state but the mutli-part payload state. As you pointed out it will also require an additional context that cannot be expressed with a single flag.

syntactically · 2026-01-22T14:08:20Z

+
+### Snapshotting
+
+Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any


I guess there is a slightly more complex issue with snapshotting as well: the state of the buffer pool in the driver will need to be considered/updated, because (a) the physical addresses of the inflight "write the next request for me here" buffers will need to be updated to the new scratch region and (b) something will need to be done (perhaps the same update? but slightly unclear, in case the guest is currently using data in those buffers) with the buffers in the allocator pool that are not currently available in the queue.

syntactically

Sorry, some kind of glitch seems to have resulted in my comments getting posted twice. I can't delete this duplicate review comment, apparently.

syntactically

I have some minor comments on the Rust definitions as well.

syntactically · 2026-01-22T14:26:32Z

+This mechanism ensures that no locks are required for synchronization, only memory barriers combined
+with atomic publishing of flags ensure that the other side will never observe a partial update:
+
+- Driver: Write descriptor fields ? memory barrier ? atomic Release-store flags


What barrier were you thinking we might need here? I would think that the synchronising release store and acquire load should always be enough.

Separately-ish, there is the issue of how to ensure that the host does not have data race UB, even in the face of a guest that maliciously violates the synchronisation protocol. This is all a bit complicated by the fact that the reads/writes on the other side are outside the Rust abstract machine. We have some notes about this in shared_mem.rs, which were written for the future thinking about this kind of concurrent update situation. There was also a recentish discussion in unsafe-code-guidelines about this topic, which came to a similar conclusion to our notes: what we really want is a relaxed atomic-per-byte memcpy to get data in/out of the guest (after the acquire atomic load), but unfortunately this operation does not exist in Rust yet. So, I think that the present approach of having an acquire load and then volatile non-atomic reads/writes is still the "least bad" option, even though it is not exactly correct w.r.t. the memory model---hopefully we can eventually switch to the per-byte atomic memcpy.

I would think that the synchronising release store and acquire load should always be enough.

That was my assumption, yes. I guess the usual "TOCTOU" issue where the guest modifies memory between host reads remains and we should probably acknowledge this in the proposal?

syntactically · 2026-01-22T14:28:47Z

+    pub struct DescFlags: u16 {
+        const NEXT     = 1 << 0;   // Buffer continues via next descriptor
+        const WRITE    = 1 << 1;   // Device write-only (otherwise read-only)
+        const INDIRECT = 1 << 2;   // Buffer contains list of descriptors


The text mentioned not supporting indirect, at least for now---do we want this flag to be allocated even though we aren't supporting it yet?

Can change this to RESERVED maybe?

syntactically · 2026-01-22T14:31:10Z

+/// Event suppression structure for controlling notifications.
+/// Both sides can control when they want to be notified.
+#[repr(C)]
+pub struct EventSuppression {


Where do these end up?

Yeah, typically these live at the end of the descriptor ring or at fixed offsets in the queue region. I should probably add this to layout diag.

syntactically · 2026-01-22T14:32:49Z

+    desc_table: DescTable,         // Descriptor table in shared memory
+    id_free: SmallVec<[u16; 256]>, // Stack of free IDs (allows out-of-order completion)
+    id_num: SmallVec<[u16; 256]>,  // Chain length per ID
+    drv_evt_addr: u64,             // Controls when device notifies about used buffers


What does _addr mean here? Are these meant to be bitpacked versions of the suppression structure above?

drv_evt_addr: u64 and dev_evt_addr: u64 are addresses pointing to the EventSuppression structures in shared memory.

syntactically · 2026-01-22T14:35:41Z

+    /// Length in bytes (for used: bytes written by device)
+    pub len: u32,
+    /// Buffer ID for correlating completions with submissions
+    pub id: u16,


Are these meant to identify an individual buffer, or are they the higher-level semantic protocol thing used to correlate e.g. calls and returns? If the former, do they gain us anything on just using the address as the buffer ID?

Per spec, id is for identifying descriptors/chains for out-of-order completion. I think using addresses as ids is problematic because: (a) addresses may be reused after deallocation, (b) chains span multiple buffers and the descriptors will share id.

syntactically · 2026-01-22T14:36:25Z

+
+impl<M: MemOps> RingProducer<M> {
+    /// Submit a buffer chain to the ring.
+    /// Returns the descriptor ID assigned to this chain.


Why do we need this as well as the _notify variant?

_with_notify variant reference the "other" side EventSuppression configuration and return if it should be notified while the other omits this step. It's used internally for optimization where for batching you want to read the suppression config once and submit without rechecking.

syntactically · 2026-01-22T14:37:44Z

+    /// Submit with notification check based on device's event suppression settings.
+    pub fn submit_available_with_notify(&mut self, chain: &BufferChain) -> Result<SubmitResult, RingError>;
+
+    /// Poll for a used buffer. Returns WouldBlock if no completions available.


We may need to think a little bit about whether there is a way to actually block/return control to the host, and whether that should be integrated into or distinct from the host->guest notification mechanism, which, at least right now, will also exit to the host in the calling thread (but perhaps in the future could be some faster IPI?)

That's a good point. One way of encoding this information would be to make should_notify an enum rather than bool and then we could express the reason why notification should happen.

syntactically · 2026-01-22T14:38:18Z

+
+#### Buffer Chain Builder
+
+Type-state pattern enforces that readable buffers are added before writable buffers, preventing


Why do we care about this "readable before writable" invariant?

This is strictly a virtio spec requirement: A buffer consists of zero or more device-readable physically-contiguous elements followed by zero or more physically-contiguous device-writable elements (each buffer has at least one element).

syntactically · 2026-01-22T14:40:35Z

+    inner: RingProducer<M>,
+    notifier: N,
+    pool: P,
+    inflight: Vec<Option<ProducerInflight>>,


What's this? I don't see the ProducerInflight definition here

This tracks buffer allocations for in-flight requests, so they can be freed when the response arrives.

syntactically · 2026-01-22T14:43:22Z

+
+```rust
+/// Trait for buffer providers.
+pub trait BufferProvider {


In case these eventually end up passed around through a lot of parsing code with indeterminate lifetimes, would it be useful for alloc/dealloc here to be more like get/put and keep a refcount for the buffer and return it to the free pool when it's unreferenced? I have no idea of this is true, just thought it was worth throwing out there.

Good point, I wanted to use Bytes::from_owner specifically for this:

https://docs.rs/bytes/latest/bytes/struct.Bytes.html#method.from_owner

/// The owner of mapped buffer, ensuring its lifetime. #[derive(Debug, Clone)] pub struct BufferOwner<P: BufferProvider, M: MemOps> { pub pool: Arc<P>, pub mem: Arc<M>, pub alloc: Allocation, pub written: usize, } impl<P: BufferProvider, M: MemOps> Drop for BufferOwner<P, M> { fn drop(&mut self) { let _ = self.pool.dealloc(self.alloc); } } impl<P: BufferProvider, M: MemOps> AsRef<[u8]> for BufferOwner<P, M> { fn as_ref(&self) -> &[u8] { // MemOps should give us a pointer to stored allocation todo!() } }

This way you can just use Bytes cheap cloning and it will reuse Bytes reference counting. My only concern is about safety of moving around a pointer to guest memory region.

Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>

dblnz · 2026-02-04T10:03:35Z

+### Snapshotting
+
+Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any
+attempt to snapshot a sandbox with such pending requests will result in a snapshot failure.


How does this look like for the user?
That means that if you have a busy guest, that keeps producing messages, the host side needs to somehow coordinate with it and stop/finish processing before creating a snapshot, right?

andreiltd requested review from danbugs, dblnz, devigned, jprendes, ludfjig, marosset, simongdavies and syntactically as code owners December 12, 2025 14:08

andreiltd added area/documentation Related to documentation kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. labels Dec 12, 2025