Skip to content

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1#13

Closed
ankita-nv wants to merge 4 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-0219
Closed

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1#13
ankita-nv wants to merge 4 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-0219

Conversation

@ankita-nv
Copy link
Copy Markdown
Contributor

@ankita-nv ankita-nv commented Feb 19, 2026

This PR addresses the following bugs:

  1. [PATCH v4 0/3] hw/vfio: Enable hugepfnmap for non-power-of-2 device memory regions
    • Backported from the latest posting that is pulled into qemu.
  2. Patch to preserve generic initiator insertion order
    • Currently under review in Qemu ML (Link currently not present)

@ankita-nv ankita-nv changed the title Nvidia stable 10.1 ankita bugfixes 0219 Qemu bugfixes for numa distance and huge pfnmap alignment Feb 19, 2026
@ankita-nv ankita-nv changed the title Qemu bugfixes for numa distance and huge pfnmap alignment Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1 Feb 19, 2026
Sort sparse mmap regions by offset during region setup to ensure
predictable mapping order, avoid overlaps and a proper handling
of the gaps between sub-regions.

Add validation to detect overlapping sparse regions early during
setup before any mapping operations begin.

The sorting is performed on the subregions ranges during
vfio_setup_region_sparse_mmaps(). This also ensures that subsequent
mapping code can rely on subregions being in ascending offset order.

This is preparatory work for alignment adjustments needed to support
hugepfnmap on systems where device memory (e.g., Grace-based systems)
may have non-power-of-2 sizes.

cc: Alex Williamson <alex@shazbot.org>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-2-ankita@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
(cherry picked from commit da02b21cc70ef04a9ad15198f33551f17c94dff5)
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-0219 branch from c9334d9 to 87c192b Compare February 22, 2026 05:27
@nvmochs nvmochs self-requested a review February 23, 2026 17:08
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Feb 23, 2026

@ankita-nv

87c192b NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order

Seeing some differences in acpi_generic_initiator_finalize() vs. what’s in RFCv2.

Should this be based on RFCv2? Or even v3 if one will be posted soon? (looks like there was some feedback on v2)


06c2bcd vfio: Add Error ** parameter to vfio_region_setup()

In the code that was fixed up as part of the backport (the vfio_region_setup in platform.c), is the “error_setg_errno()” still needed when vfio_region_setup() fails?

Add an Error **errp parameter to vfio_region_setup() and
vfio_setup_region_sparse_mmaps to allow proper error handling
instead of just returning error codes.

The function sets errors via error_setg() when failure occur.

Suggested-by: Cedric Le Goater <clg@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-3-ankita@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
(backported from commit c42010197eb905fe826550bb5f7c236d5534ddb4)
[ankita: added the Error** param to vfio_region_setup in hw/vfio/platform.c]
[ankita: Removed the error_setg_errno after vfio_region_setup in platform.c]
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
On Grace-based systems such as GB200, device memory is exposed as a
BAR but the actual mappable size is not power-of-2 aligned. The
previous algorithm aligned each sparse mmap area based on its
individual size using ctz64() which prevented efficient huge page
usage by the kernel.

Adjust VFIO region mapping alignment to use the next power-of-2 of
the total region size and place the sparse subregions at their
appropriate offset. This provides better opportunities to get huge
alignment allowing the kernel to use larger page sizes for the VMA.

This enables the use of PMD-level huge pages which can significantly
improve memory access performance and reduce TLB pressure for large
device memory regions.

With this change:
- Create a single aligned base mapping for the entire region
- Change Alignment to be based on pow2ceil(region->size), capped at 1GiB
- Unmap gaps between sparse regions
- Use MAP_FIXED to overlay sparse mmap areas at their offsets

Example VMA for device memory of size 0x2F00F00000 on GB200:

Before (misaligned, no hugepfnmap):
ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

After (aligned to 1GiB boundary, hugepfnmap enabled):
ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

Requires sparse regions to be sorted by offset (done in previous
patch) to correctly identify and handle gaps.

cc: Alex Williamson <alex@shazbot.org>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-4-ankita@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
(backported from commit 3863e47828d5bda1776fb7588a2187c7fba1d0c2)
[ankita: resolved minor conflict in vfio_region_mmap to set variables]
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-0219 branch from 87c192b to 5b4d9ea Compare February 24, 2026 06:28
@ankita-nv
Copy link
Copy Markdown
Contributor Author

87c192b NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order
Should this be based on RFCv2? Or even v3 if one will be posted soon? (looks like there was some feedback on v2)

Updated to v2. Discussions are still ongoing. There aren't any change identified yet.

In the code that was fixed up as part of the backport (the vfio_region_setup in platform.c), is the “error_setg_errno()” still needed when vfio_region_setup() fails?

Thanks for pointing that out. Fixed.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Feb 24, 2026

Thanks for updating the branch.


5b4d9ea NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order

One more nit I noticed when reviewing just now:

(backported from https://lore.kernel.org/all/20260217153010.408739-4-ankita@nvidia.com/)

This link is for a different patch (copy / paste oops from the patch just before it).

@ankita-nv
Copy link
Copy Markdown
Contributor Author

Thanks for catching that. However somehow I am not seeing the posting link on the ML.

@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-0219 branch from 5b4d9ea to ac9edc8 Compare February 25, 2026 13:54
@ankita-nv
Copy link
Copy Markdown
Contributor Author

@nvmochs somehow there isn't a link to the v2 posting for the following patch:
"NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order"

Can we consider pulling this PR without the link (as it is marked NVIDIA: SAUCE:)?

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 10, 2026

@nvmochs somehow there isn't a link to the v2 posting for the following patch: "NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order"

Can we consider pulling this PR without the link (as it is marked NVIDIA: SAUCE:)?

@ankita-nv In that case would it make sense to at least have a link to one of the replies? e.g. https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/

Perhaps something like this:

cc: Shameer Kolothum <skolothumtho@nvidia.com>
Fixes: https://github.com/ankita-nv/qemu-gpu/commit/0a5b5acdf2d8c7302ca48d42e6ef3423e1b956d5 ("hw/acpi: Implement the SRAT GI affinity structure")
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
(backported from https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/)
[ankita: ML links to discussion and not patch as the original ML posting was lost]

During creation of the VM's SRAT table, the generic initiator entries
are added. Currently the order in the entries are not controllable from
the qemu command. This is due to the fact that the code queries the
object tree which may not be in the order objects were inserted.

As a fix the patch maintains a GPtrArray of generic initiator objects
that preserves their insertion order. Objects are automatically added
to the array when initialized and removed when finalized. When building
the SRAT table, objects are processed in the order they were first
inserted.

E.g. for the following qemu command.
...
            -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
            -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
            -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
            -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
            -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
            -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
            -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
            -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
...

Original PXM in the VM SRAT table:
[1A4h 0420 004h]            Proximity Domain : 00000007
[1C4h 0452 004h]            Proximity Domain : 00000006
[1E4h 0484 004h]            Proximity Domain : 00000005
[204h 0516 004h]            Proximity Domain : 00000004
[224h 0548 004h]            Proximity Domain : 00000003
[244h 0580 004h]            Proximity Domain : 00000009
[264h 0612 004h]            Proximity Domain : 00000002
[284h 0644 004h]            Proximity Domain : 00000008
[2A2h 0674 004h]            Proximity Domain : 00000009

After the patch (preserves insertion order):
[1A4h 0420 004h]            Proximity Domain : 00000002
[1C4h 0452 004h]            Proximity Domain : 00000003
[1E4h 0484 004h]            Proximity Domain : 00000004
[204h 0516 004h]            Proximity Domain : 00000005
[224h 0548 004h]            Proximity Domain : 00000006
[244h 0580 004h]            Proximity Domain : 00000007
[264h 0612 004h]            Proximity Domain : 00000008
[284h 0644 004h]            Proximity Domain : 00000009

cc: Shameer Kolothum <skolothumtho@nvidia.com>
Fixes: 0a5b5ac ("hw/acpi: Implement the SRAT GI affinity structure")
(backported from https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/)
[ankita: ML links to discussion and not patch as the original ML posting was lost]
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-0219 branch from ac9edc8 to 3ca9dc4 Compare March 10, 2026 03:02
@ankita-nv
Copy link
Copy Markdown
Contributor Author

@nvmochs somehow there isn't a link to the v2 posting for the following patch: "NVIDIA: SAUCE: hw/acpi/pci.c: preserve generic initiator insertion order"
Can we consider pulling this PR without the link (as it is marked NVIDIA: SAUCE:)?

@ankita-nv In that case would it make sense to at least have a link to one of the replies? e.g. https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/

Perhaps something like this:

cc: Shameer Kolothum <skolothumtho@nvidia.com>
Fixes: https://github.com/ankita-nv/qemu-gpu/commit/0a5b5acdf2d8c7302ca48d42e6ef3423e1b956d5 ("hw/acpi: Implement the SRAT GI affinity structure")
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
(backported from https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/)
[ankita: ML links to discussion and not patch as the original ML posting was lost]

Thanks for the suggestion! Added.

Copy link
Copy Markdown
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further issues.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

Copy link
Copy Markdown

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, with one minor question inline about a reference to vfio_region_setup in trace-events. (wondering if any corresponding update is needed there, given that there was a signature change).

Comment thread hw/acpi/pci.c
build_acpi_generic_initiator,
table_data);
build_all_acpi_generic_initiators(table_data);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 3ca9dc4 LGTM

Comment thread hw/vfio/region.c

int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
int index, const char *name)
int index, const char *name, Error **errp)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this signature change also need to be represented in hw/vfio/trace-events? I see an unmodified reference to vfio_region_setup there.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the actual trace is not being modified - it's only called in the good path and there isn't a need to trace errp there.

@ankita-nv - Please review and confirm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct.

Comment thread hw/vfio/region.c
region->mmaps[0].offset = 0;
region->mmaps[0].size = region->size;
} else if (ret) {
return ret;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 3e3d59a looks good to me

Comment thread hw/vfio/region.c
* Unmap the rest of the region not covered by sparse mmap.
*/
if (map_offset < region->size) {
munmap(map_align + map_offset, region->size - map_offset);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Looks consistent with landed upstream patch qemu/qemu@3863e47

Copy link
Copy Markdown

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all, LGTM. Approved!

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 11, 2026

Merged and will be present in 1:10.1.0+nvidia5-1 (and later). Closing PR.

@nvmochs nvmochs closed this Mar 11, 2026
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
Commit e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory
region") made the name member of MemoryRegion unset, causing a NULL
pointer dereference[1]:
> Thread 2 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x00007ffff56565e2 in __strcmp_evex () at /lib64/libc.so.6
> #1  0x0000555555841bdb in find_fd (head=0x5555572337d0 <cpr_state>,
> name=0x0, id=0) at ../migration/cpr.c:68
> #2  cpr_delete_fd (name=name@entry=0x0, id=id@entry=0) at
> ../migration/cpr.c:77
> #3  0x000055555582290a in qemu_ram_free (block=0x7ff7e93aa7f0) at
> ../system/physmem.c:2615
> #4  0x000055555581ae02 in memory_region_finalize (obj=<optimized out>)
> at ../system/memory.c:1816
> #5  0x0000555555a70ab9 in object_deinit (obj=<optimized out>,
> type=<optimized out>) at ../qom/object.c:715
> #6  object_finalize (data=0x7ff7e936eff0) at ../qom/object.c:729
> #7  object_unref (objptr=0x7ff7e936eff0) at ../qom/object.c:1232
> #8  0x0000555555814fae in memory_region_unref (mr=<optimized out>) at
> ../system/memory.c:1848
> #9  flatview_destroy (view=0x555559ed6c40) at ../system/memory.c:301
> #10 0x0000555555bfc122 in call_rcu_thread (opaque=<optimized out>) at
> ../util/rcu.c:324
> #11 0x0000555555bf17a7 in qemu_thread_start (args=0x555557b99520) at
> ../util/qemu-thread-posix.c:393
> #12 0x00007ffff556f464 in start_thread () at /lib64/libc.so.6
> #13 0x00007ffff55f25ac in __clone3 () at /lib64/libc.so.6

The intention of the aforementioned commit is to prevent a MemoryRegion
from parenting itself while its references is counted indendependently
of the device. To achieve the same goal, add a type of QOM objects that
count references and parent MemoryRegions.

[1] https://lore.kernel.org/qemu-devel/4eb93d7a-1fa9-4b3c-8ad7-a2eb64f025a0@collabora.com/

Cc: qemu-stable@nongnu.org
Fixes: e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory region")
Fixes: be88ad4 ("virtio-gpu-virgl: correct parent for blob memory region") for 10.2.x
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Tested-by: Joelle van Dyne <j@getutm.app>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20260214-region-v1-1-229f00ae1f38@rsg.ci.i.u-tokyo.ac.jp>
(cherry picked from commit b2a279094c3b86667969cc645f7fb1087e08dd19)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
The test case in the ppe42 functional test triggers a TCG debug
assertion, which causes the test to fail in an --enable-debug
build or when the sanitizers are enabled:

#6  0x00007ffff4a3b517 in __assert_fail
    (assertion=0x5555562e7589 "!temp_readonly(ots)", file=0x5555562e5b23 "../../tcg/tcg.c", line=4928, function=0x5555562e8900 <__PRETTY_FUNCTION__.23> "tcg_reg_alloc_mov") at ./assert/assert.c:105
#7  0x0000555555cc2189 in tcg_reg_alloc_mov (s=0x7fff60000b70, op=0x7fff600126f8) at ../../tcg/tcg.c:4928
#8  0x0000555555cc74e0 in tcg_gen_code (s=0x7fff60000b70, tb=0x7fffa802f540, pc_start=4294446080) at ../../tcg/tcg.c:6667
#9  0x0000555555d02abe in setjmp_gen_code
    (env=0x555556cbe610, tb=0x7fffa802f540, pc=4294446080, host_pc=0x7fffeea00c00, max_insns=0x7fffee9f9d74, ti=0x7fffee9f9d90)
    at ../../accel/tcg/translate-all.c:257
#10 0x0000555555d02d75 in tb_gen_code (cpu=0x555556cba590, s=...) at ../../accel/tcg/translate-all.c:325
#11 0x0000555555cf5922 in cpu_exec_loop (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:970
#12 0x0000555555cf5aae in cpu_exec_setjmp (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:1016
#13 0x0000555555cf5b4b in cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/cpu-exec.c:1042
#14 0x0000555555d1e7ab in tcg_cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/tcg-accel-ops.c:82
#15 0x0000555555d1ff97 in rr_cpu_thread_fn (arg=0x555556cba590) at ../../accel/tcg/tcg-accel-ops-rr.c:285
#16 0x00005555561586c9 in qemu_thread_start (args=0x555556ee3c90) at ../../util/qemu-thread-posix.c:393
#17 0x00007ffff4a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#18 0x00007ffff4b29c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

This can be reproduced "by hand":

 ./build/clang/qemu-system-ppc -display none -vga none \
    -machine ppe42_machine -serial stdio \
    -device loader,file=$HOME/.cache/qemu/download/03c1ac0fb7f6c025102a02776a93b35101dae7c14b75e4eab36a337e39042ea8 \
    -device loader,addr=0xfff80040,cpu-num=0

(assuming you have the image file from the functional test
in your local cache).

This happens for this input:

IN:
0xfff80c00:  07436004  .byte    0x07, 0x43, 0x60, 0x04

which generates (among other things):

 not_i32 $0x80000,$0x80000

which the TCG optimization pass turns into:

 mov_i32 $0x80000,$0xfff7ffff             dead: 1  pref=0xffff

and where we then assert because we tried to write to a constant.

This happens for the CLRBWIBC instruction which ends up in
do_mask_branch() with rb_is_gpr false and invert true.  In this case
we will generate code that sets mask to a tcg_constant_tl() but then
uses it as the LHS in tcg_gen_not_tl().

Fix the assertion by doing the invert in the translate time C code
for the "mask is constant" case.

Cc: qemu-stable@nongnu.org
Fixes: f7ec91c ("target/ppc: Add IBM PPE42 special instructions")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Glenn Miles <milesg@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260212150753.1749448-1-peter.maydell@linaro.org
Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
(cherry picked from commit 78c6b6010ce7cfa54874dda514e694640b76f1e4)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants