Skip to content

Fix L4 checksum for virtio port#551

Merged
rjarry merged 1 commit intoDPDK:mainfrom
david-marchand:hack_l4_csum_virtio
Mar 17, 2026
Merged

Fix L4 checksum for virtio port#551
rjarry merged 1 commit intoDPDK:mainfrom
david-marchand:hack_l4_csum_virtio

Conversation

@david-marchand
Copy link
Copy Markdown
Member

@david-marchand david-marchand commented Mar 13, 2026

In case L4 checksums offload have been negotiated at the qemu level, L4 packets may be received from virtio ports with a L4_NONE status, which means that the checksum is incorrect in the data, yet this packet should be treated as valid (described in the DPDK API).

The problem is that grout does not implement L4 checksum handling and just lets this packet go through the stack, leaving the checksum as invalid.

This patch is just a hack.

A good workaround for now is to disable csum as the virtio level via some libvirt xml, like:

...

Fix L4 checksum for virtio port with offload capability

Technical impact: add a virtio-specific RX path that computes and marks L4 checksums for TCP/UDP packets when the virtio driver negotiated L4 checksum offload but packets arrive with L4 status set to NONE.

Changes

modules/infra/control/gr_port.h

  • Add bool virtio_offloads to the GR_IFACE_INFO payload for GR_IFACE_TYPE_PORT to record whether the port is a virtio device advertising L4 offloads.

modules/infra/control/port.c

  • In iface_port_init, call rte_eth_dev_info_get(port_id, &info) and set port->virtio_offloads = true when info.driver_name == "net_virtio" and info.rx_offload_capa includes RTE_ETH_RX_OFFLOAD_TCP_CKSUM. Error path added for rte_eth_dev_info_get failure.

modules/infra/control/graph.c

  • In worker_graph_new selection logic, prefer virtio-specific RX callbacks when port->virtio_offloads is set:
    • Non-bonded interfaces: select rx_virtio_process before falling back to rx_offload_process or rx_process.
    • Bonded interfaces: select rx_bond_virtio_process before rx_bond_offload_process or rx_bond_process.
      This confines the driver-specific selection to graph initialization.

modules/infra/datapath/gr_rxtx.h

  • Add RXTX_F_VIRTIO to rxtx_flags_t.
  • Export prototypes: rx_virtio_process(...) and rx_bond_virtio_process(...).

modules/infra/datapath/port_rx.c

  • Add trace formatting to include "virtio".
  • Implement static fix_l4_csum(struct rte_mbuf *m):
    • Return unless mbuf OL status indicates RX L4 checksum is NONE.
    • Use rte_net_get_ptype to obtain L2/L3/L4 header lengths and L4 type.
    • Compute raw checksum over L4 header + payload via rte_raw_cksum_mbuf, finalize to ones-complement form, handle UDP zero-checksum special-case per RFC768, write checksum into packet at computed offset, and set mbuf OL flag RX_L4_CKSUM_GOOD.
  • Implement rx_virtio_process(...):
    • Pull RX burst, perform VLAN handling and iface assignment, call fix_l4_csum on each mbuf needing correction, set RXTX_F_VIRTIO in trace flags, and forward to IFACE_INPUT.
  • Implement rx_bond_virtio_process(...):
    • Same behavior as rx_virtio_process but selects per-bond iface and sets RXTX_F_BOND in trace flags.
  • Leave existing rx_offload_process and rx_process unchanged for non-virtio devices.

Behavioral result

  • For virtio devices that advertise L4 checksum offload but present packets with L4_NONE, the new virtio RX paths compute and install the correct L4 checksum into the packet and mark the mbuf as having a good RX L4 checksum so downstream processing treats the packet as valid.

@david-marchand
Copy link
Copy Markdown
Member Author

@rjarry @christophefontaine should we create a github issue for tracking this problem?
At least this patch is not acceptable in its current form.

@rjarry
Copy link
Copy Markdown
Collaborator

rjarry commented Mar 16, 2026

@rjarry @christophefontaine should we create a github issue for tracking this problem?

Yes, so we don't forget it.

At least this patch is not acceptable in its current form.

Even if it needs some polishing, I don't think we can do much better. It will need an ugly wart somewhere, either in the rx or tx paths.

@david-marchand
Copy link
Copy Markdown
Member Author

Would it be possible to have a virtio-rx node? (or maybe some node between port_rx and iface_input)
It would avoid adding checks for other drivers.

@christophefontaine
Copy link
Copy Markdown
Collaborator

christophefontaine commented Mar 16, 2026

Would it be possible to have a virtio-rx node? (or maybe some node between port_rx and iface_input) It would avoid adding checks for other drivers.

In worker_graph_new, we already select the rx_process function based on the iface mode.
I think this would be the proper place to select the virtio_rx_process, WDYT ?

                // select the appropriate node process callback
                if (ctx->iface->mode == GR_IFACE_MODE_BOND) {
                        if (port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
                                node->process = rx_bond_offload_process;
                        else
                                node->process = rx_bond_process;
                } else {
                        if (port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
                                node->process = rx_offload_process;
                        else
                                node->process = rx_process;
                }

@rjarry
Copy link
Copy Markdown
Collaborator

rjarry commented Mar 16, 2026

Yes this is a good approach. Zero cost for other drivers.

@david-marchand david-marchand changed the title DO NOT MERGE: hack L4 checksum for virtio port Fix L4 checksum for virtio port Mar 16, 2026
@david-marchand
Copy link
Copy Markdown
Member Author

I revamped the code, and fixed other combinations.

@david-marchand david-marchand marked this pull request as ready for review March 16, 2026 14:57
@david-marchand
Copy link
Copy Markdown
Member Author

Rebased after unrelated failure in evpn smoke test.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 16, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a virtio_offloads bool to port iface info, detects virtio-capable ports via rte_eth_dev_info_get and driver/offload checks, introduces RXTX_F_VIRTIO, and adds two virtio-specific RX handlers (rx_virtio_process, rx_bond_virtio_process) with graph-node selection preferring virtio paths when the flag is set.

Critical bug: on rte_eth_dev_info_get failure the code returns errno_set(-ret) though ret is not set, causing undefined behavior.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modules/infra/control/graph.c (1)

163-176: ⚠️ Potential issue | 🔴 Critical

Virtio process selection doesn't verify VLAN strip capability is negotiated.

The virtio_offloads flag is set based on driver name and TCP checksum capability only, but the selection logic at lines 163–176 uses this flag without verifying that VLAN strip offload was actually negotiated. If a virtio device has TCP checksum capability but lacks VLAN strip capability, virtio_offloads is true but port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP is false, causing the virtio path to be selected for a device that cannot provide hardware VLAN stripping. The virtio process functions rely on the RTE_MBUF_F_RX_VLAN_STRIPPED flag and set vlan_id = 0 when the flag is absent, resulting in broken VLAN handling.

Either check negotiated offloads in the selection condition:

if (port->virtio_offloads && (port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP))

Or require VLAN strip capability when setting virtio_offloads in port.c:605–606.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/infra/control/graph.c` around lines 163 - 176, The virtio path is
being selected based solely on port->virtio_offloads which doesn't guarantee
VLAN strip was negotiated; update the selection to require negotiated VLAN strip
by changing the conditions that assign node->process for virtio to check both
port->virtio_offloads and (port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
(i.e., use if (port->virtio_offloads && (port->rx_offloads &
RTE_ETH_RX_OFFLOAD_VLAN_STRIP)) for both the bonded and non-bonded branches), or
alternatively tighten the code that sets port->virtio_offloads so it is only
true when VLAN strip is also negotiated (modify the port.c logic around
virtio_offloads assignment); ensure the final node->process selections still
choose rx_virtio_process/rx_bond_virtio_process only when VLAN strip capability
is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modules/infra/datapath/port_rx.c`:
- Around line 154-155: The bounds check and write are unsafe: replace the
single-segment check using rte_pktmbuf_data_len(m) and the direct use of
rte_pktmbuf_mtod_offset() with a check that ensures two bytes are available (use
rte_pktmbuf_pkt_len(m) >= csum_offset + sizeof(uint16_t)) and either restrict to
single-segment mbufs (assert rte_pktmbuf_is_contiguous or check
rte_pktmbuf_data_len(m) >= csum_offset + 2) or implement a helper that writes a
uint16_t across segments; update the write site that currently uses
rte_pktmbuf_mtod_offset(m, uint16_t *, csum_offset) = csum to call that helper
(or perform a safe memcpy into the proper segment(s)) so no out-of-bounds write
occurs and multi-segment mbufs are handled or explicitly documented as
unsupported.

---

Outside diff comments:
In `@modules/infra/control/graph.c`:
- Around line 163-176: The virtio path is being selected based solely on
port->virtio_offloads which doesn't guarantee VLAN strip was negotiated; update
the selection to require negotiated VLAN strip by changing the conditions that
assign node->process for virtio to check both port->virtio_offloads and
(port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP) (i.e., use if
(port->virtio_offloads && (port->rx_offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP))
for both the bonded and non-bonded branches), or alternatively tighten the code
that sets port->virtio_offloads so it is only true when VLAN strip is also
negotiated (modify the port.c logic around virtio_offloads assignment); ensure
the final node->process selections still choose
rx_virtio_process/rx_bond_virtio_process only when VLAN strip capability is
present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5d652098-22ad-4ab6-b7e4-e6f9cf02263e

📥 Commits

Reviewing files that changed from the base of the PR and between 3e075e6 and 5fe7867.

📒 Files selected for processing (5)
  • modules/infra/control/gr_port.h
  • modules/infra/control/graph.c
  • modules/infra/control/port.c
  • modules/infra/datapath/gr_rxtx.h
  • modules/infra/datapath/port_rx.c

Comment thread modules/infra/datapath/port_rx.c Outdated
Comment thread modules/infra/datapath/port_rx.c
Comment thread modules/infra/datapath/port_rx.c Outdated
Comment on lines +122 to +123
uint32_t ptype, hdr_len;
uint16_t csum = 0, csum_offset;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it does not matter for checksum computation, but I am always confused when we store network order stuff into a uint16_t variable. Just for the sake of clarity could we use rte_be16t csum?

Also, can, you reorder definitions to have a reverse xmas tree ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect "make format" to handle formatting by itself..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not reorder lines since this may break compilation. There is no "reverse xmas tree" option unfortunately :(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the rte_be16_t topic, the DPDK API takes a uint16_t as input.
rte_be16_t is exactly aligned, so it would probably not break anything.

I just see little point.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we stick to rte_be16_t to be consistent with TCP/UDP header field definitions.

https://github.com/DPDK/dpdk/blob/v25.11/lib/net/rte_tcp.h#L32
https://github.com/DPDK/dpdk/blob/v25.11/lib/net/rte_udp.h#L28

Comment thread modules/infra/datapath/port_rx.c
Comment thread modules/infra/control/port.c Outdated

port->port_id = port_id;
if (rte_eth_dev_info_get(port_id, &info) < 0)
return errno_set(EINVAL);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this call even fail at this point? If it does, at least preserve the returned error code:

	if ((ret = rte_eth_dev_info_get(port_id, &info)) < 0)
		return errno_set(-ret);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can't, but this helper is marked with warn_unused_result.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then let's preserve the return value for clarity.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modules/infra/datapath/port_rx.c (1)

154-155: ⚠️ Potential issue | 🔴 Critical

Off-by-one error: bounds check insufficient for 2-byte write.

Writing a uint16_t at csum_offset requires valid memory for 2 bytes (csum_offset and csum_offset + 1), but the check only ensures 1 byte is available.

🐛 Proposed fix
-	if (rte_pktmbuf_data_len(m) >= csum_offset + 1)
+	if (rte_pktmbuf_data_len(m) >= csum_offset + sizeof(uint16_t))
 		*rte_pktmbuf_mtod_offset(m, uint16_t *, csum_offset) = csum;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/infra/datapath/port_rx.c` around lines 154 - 155, The bounds check
before writing a uint16_t is off-by-one: update the condition that currently
uses "rte_pktmbuf_data_len(m) >= csum_offset + 1" to ensure two bytes are
available (e.g., compare against csum_offset + sizeof(uint16_t) or + 2) so the
write performed via rte_pktmbuf_mtod_offset(m, uint16_t *, csum_offset) = csum
does not overflow; adjust the check in the same block in port_rx.c where the
write occurs to validate data length >= csum_offset + sizeof(uint16_t) (or
equivalent) before performing the 2-byte store.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modules/infra/datapath/port_rx.c`:
- Around line 154-155: The bounds check before writing a uint16_t is off-by-one:
update the condition that currently uses "rte_pktmbuf_data_len(m) >= csum_offset
+ 1" to ensure two bytes are available (e.g., compare against csum_offset +
sizeof(uint16_t) or + 2) so the write performed via rte_pktmbuf_mtod_offset(m,
uint16_t *, csum_offset) = csum does not overflow; adjust the check in the same
block in port_rx.c where the write occurs to validate data length >= csum_offset
+ sizeof(uint16_t) (or equivalent) before performing the 2-byte store.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f7972e0-5d4a-4626-bc48-40ec80a5a51d

📥 Commits

Reviewing files that changed from the base of the PR and between 5fe7867 and 5ab23ee.

📒 Files selected for processing (5)
  • modules/infra/control/gr_port.h
  • modules/infra/control/graph.c
  • modules/infra/control/port.c
  • modules/infra/datapath/gr_rxtx.h
  • modules/infra/datapath/port_rx.c

@david-marchand david-marchand force-pushed the hack_l4_csum_virtio branch 3 times, most recently from 25d1509 to bf36db3 Compare March 17, 2026 08:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
modules/infra/control/port.c (1)

603-604: ⚠️ Potential issue | 🔴 Critical

Capture the rte_eth_dev_info_get() error code before returning.

At Line 603, the call result is not stored, but Line 604 returns -ret. On failure this can return a stale/incorrect value (including success), masking init errors.

🔧 Suggested fix
-	if (rte_eth_dev_info_get(port_id, &info) < 0)
-		return errno_set(-ret);
+	if ((ret = rte_eth_dev_info_get(port_id, &info)) < 0)
+		return errno_set(-ret);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/infra/control/port.c` around lines 603 - 604, Store and use the
actual return value from rte_eth_dev_info_get instead of returning a stale -ret;
assign the function result to a local int (e.g., ret =
rte_eth_dev_info_get(port_id, &info)), check if ret < 0 and then return
errno_set(-ret). Update the code around the call to rte_eth_dev_info_get so the
returned error code is the one propagated (referencing rte_eth_dev_info_get,
errno_set, port_id, and info).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modules/infra/datapath/port_rx.c`:
- Around line 156-160: The code currently sets m->ol_flags to
RTE_MBUF_F_RX_L4_CKSUM_GOOD unconditionally even when the checksum write inside
the if (rte_pktmbuf_data_len(m) >= csum_offset + 2) fails; modify the logic so
that the checksum flags (m->ol_flags &= ~RTE_MBUF_F_RX_L4_CKSUM_MASK;
m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_GOOD) are only applied when the checksum
was actually written (i.e., inside the successful branch where
rte_pktmbuf_mtod_offset(m, rte_be16_t *, csum_offset) = csum is executed); in
the else branch clear any RX_L4_CKSUM_GOOD bit or set an appropriate bad/unknown
state instead so packets that skipped the write are not marked GOOD.

---

Duplicate comments:
In `@modules/infra/control/port.c`:
- Around line 603-604: Store and use the actual return value from
rte_eth_dev_info_get instead of returning a stale -ret; assign the function
result to a local int (e.g., ret = rte_eth_dev_info_get(port_id, &info)), check
if ret < 0 and then return errno_set(-ret). Update the code around the call to
rte_eth_dev_info_get so the returned error code is the one propagated
(referencing rte_eth_dev_info_get, errno_set, port_id, and info).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 97c53c91-7ded-499f-8bcb-c7b3ed79325d

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab23ee and 25d1509.

📒 Files selected for processing (5)
  • modules/infra/control/gr_port.h
  • modules/infra/control/graph.c
  • modules/infra/control/port.c
  • modules/infra/datapath/gr_rxtx.h
  • modules/infra/datapath/port_rx.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • modules/infra/control/graph.c

Comment thread modules/infra/datapath/port_rx.c
In case L4 checksums have been negotiated at the qemu level, L4 packets
may be received from virtio ports with a L4_CKSUM_NONE status, which means
that the checksum is incorrect in the data, yet this packet should be
treated as valid (described in the DPDK API).

The problem is that grout does not implement L4 checksum handling and
just lets this packet go through the stack, leaving the checksum as
invalid.

Add dedicated process helpers for net/virtio ports and switch to them if
csum were enabled for this virtio port.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
@grout-bot grout-bot force-pushed the hack_l4_csum_virtio branch from bf36db3 to bd412c0 Compare March 17, 2026 08:12
@rjarry rjarry merged commit 0f7b91c into DPDK:main Mar 17, 2026
6 checks passed
@david-marchand david-marchand deleted the hack_l4_csum_virtio branch March 18, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants