Qemu bugfixes for numa distance and huge pfnmap alignment for 10.2 by ankita-nv · Pull Request #14 · NVIDIA/QEMU

ankita-nv · 2026-02-19T13:45:41Z

This PR addresses the following bugs:

Correct setting of numa distances
- This is under internal review
[PATCH v4 0/3] hw/vfio: Enable hugepfnmap for non-power-of-2 device memory regions
- Backported from the latest posting that is pulled into qemu.

Sort sparse mmap regions by offset during region setup to ensure predictable mapping order, avoid overlaps and a proper handling of the gaps between sub-regions. Add validation to detect overlapping sparse regions early during setup before any mapping operations begin. The sorting is performed on the subregions ranges during vfio_setup_region_sparse_mmaps(). This also ensures that subsequent mapping code can rely on subregions being in ascending offset order. This is preparatory work for alignment adjustments needed to support hugepfnmap on systems where device memory (e.g., Grace-based systems) may have non-power-of-2 sizes. cc: Alex Williamson <alex@shazbot.org> Reviewed-by: Alex Williamson <alex@shazbot.org> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-2-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit da02b21cc70ef04a9ad15198f33551f17c94dff5) Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Add an Error **errp parameter to vfio_region_setup() and vfio_setup_region_sparse_mmaps to allow proper error handling instead of just returning error codes. The function sets errors via error_setg() when failure occur. Suggested-by: Cedric Le Goater <clg@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-3-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit c42010197eb905fe826550bb5f7c236d5534ddb4) Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

On Grace-based systems such as GB200, device memory is exposed as a BAR but the actual mappable size is not power-of-2 aligned. The previous algorithm aligned each sparse mmap area based on its individual size using ctz64() which prevented efficient huge page usage by the kernel. Adjust VFIO region mapping alignment to use the next power-of-2 of the total region size and place the sparse subregions at their appropriate offset. This provides better opportunities to get huge alignment allowing the kernel to use larger page sizes for the VMA. This enables the use of PMD-level huge pages which can significantly improve memory access performance and reduce TLB pressure for large device memory regions. With this change: - Create a single aligned base mapping for the entire region - Change Alignment to be based on pow2ceil(region->size), capped at 1GiB - Unmap gaps between sparse regions - Use MAP_FIXED to overlay sparse mmap areas at their offsets Example VMA for device memory of size 0x2F00F00000 on GB200: Before (misaligned, no hugepfnmap): ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727 /dev/vfio/devices/vfio1 After (aligned to 1GiB boundary, hugepfnmap enabled): ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727 /dev/vfio/devices/vfio1 Requires sparse regions to be sorted by offset (done in previous patch) to correctly identify and handle gaps. cc: Alex Williamson <alex@shazbot.org> Reviewed-by: Alex Williamson <alex@shazbot.org> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-4-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (backported from commit 3863e47828d5bda1776fb7588a2187c7fba1d0c2) [ankita: resolved minor conflict in vfio_region_mmap to set variables] Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

During creation of the VM's SRAT table, the generic initiator entries are added. Currently the order in the entries are not controllable from the qemu command. This is due to the fact that the code queries the object tree which may not be in the order objects were inserted. As a fix the patch maintains a GPtrArray of generic initiator objects that preserves their insertion order. Objects are automatically added to the array when initialized and removed when finalized. When building the SRAT table, objects are processed in the order they were first inserted. E.g. for the following qemu command. ... -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \ -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \ -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \ -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ ... Original PXM in the VM SRAT table: [1A4h 0420 004h] Proximity Domain : 00000007 [1C4h 0452 004h] Proximity Domain : 00000006 [1E4h 0484 004h] Proximity Domain : 00000005 [204h 0516 004h] Proximity Domain : 00000004 [224h 0548 004h] Proximity Domain : 00000003 [244h 0580 004h] Proximity Domain : 00000009 [264h 0612 004h] Proximity Domain : 00000002 [284h 0644 004h] Proximity Domain : 00000008 [2A2h 0674 004h] Proximity Domain : 00000009 After the patch (preserves insertion order): [1A4h 0420 004h] Proximity Domain : 00000002 [1C4h 0452 004h] Proximity Domain : 00000003 [1E4h 0484 004h] Proximity Domain : 00000004 [204h 0516 004h] Proximity Domain : 00000005 [224h 0548 004h] Proximity Domain : 00000006 [244h 0580 004h] Proximity Domain : 00000007 [264h 0612 004h] Proximity Domain : 00000008 [284h 0644 004h] Proximity Domain : 00000009 cc: Shameer Kolothum <skolothumtho@nvidia.com> Fixes: 0a5b5ac ("hw/acpi: Implement the SRAT GI affinity structure") (backported from https://lore.kernel.org/all/20260223112236.000065aa@huawei.com/) [ankita: ML links to discussion and not patch as the original ML posting was lost] Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

ankita-nv · 2026-03-30T01:51:22Z

Can we get this merged as well? This is Qemu 10.2 version of #13.

nvmochs · 2026-03-30T14:23:19Z

Can we get this merged as well? This is Qemu 10.2 version of #13.

Yes, this will be merged into 10.2 once that branch is ready.

The test case in the ppe42 functional test triggers a TCG debug assertion, which causes the test to fail in an --enable-debug build or when the sanitizers are enabled: #6 0x00007ffff4a3b517 in __assert_fail (assertion=0x5555562e7589 "!temp_readonly(ots)", file=0x5555562e5b23 "../../tcg/tcg.c", line=4928, function=0x5555562e8900 <__PRETTY_FUNCTION__.23> "tcg_reg_alloc_mov") at ./assert/assert.c:105 #7 0x0000555555cc2189 in tcg_reg_alloc_mov (s=0x7fff60000b70, op=0x7fff600126f8) at ../../tcg/tcg.c:4928 #8 0x0000555555cc74e0 in tcg_gen_code (s=0x7fff60000b70, tb=0x7fffa802f540, pc_start=4294446080) at ../../tcg/tcg.c:6667 #9 0x0000555555d02abe in setjmp_gen_code (env=0x555556cbe610, tb=0x7fffa802f540, pc=4294446080, host_pc=0x7fffeea00c00, max_insns=0x7fffee9f9d74, ti=0x7fffee9f9d90) at ../../accel/tcg/translate-all.c:257 #10 0x0000555555d02d75 in tb_gen_code (cpu=0x555556cba590, s=...) at ../../accel/tcg/translate-all.c:325 #11 0x0000555555cf5922 in cpu_exec_loop (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:970 #12 0x0000555555cf5aae in cpu_exec_setjmp (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:1016 #13 0x0000555555cf5b4b in cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/cpu-exec.c:1042 #14 0x0000555555d1e7ab in tcg_cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/tcg-accel-ops.c:82 #15 0x0000555555d1ff97 in rr_cpu_thread_fn (arg=0x555556cba590) at ../../accel/tcg/tcg-accel-ops-rr.c:285 #16 0x00005555561586c9 in qemu_thread_start (args=0x555556ee3c90) at ../../util/qemu-thread-posix.c:393 #17 0x00007ffff4a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447 #18 0x00007ffff4b29c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 This can be reproduced "by hand": ./build/clang/qemu-system-ppc -display none -vga none \ -machine ppe42_machine -serial stdio \ -device loader,file=$HOME/.cache/qemu/download/03c1ac0fb7f6c025102a02776a93b35101dae7c14b75e4eab36a337e39042ea8 \ -device loader,addr=0xfff80040,cpu-num=0 (assuming you have the image file from the functional test in your local cache). This happens for this input: IN: 0xfff80c00: 07436004 .byte 0x07, 0x43, 0x60, 0x04 which generates (among other things): not_i32 $0x80000,$0x80000 which the TCG optimization pass turns into: mov_i32 $0x80000,$0xfff7ffff dead: 1 pref=0xffff and where we then assert because we tried to write to a constant. This happens for the CLRBWIBC instruction which ends up in do_mask_branch() with rb_is_gpr false and invert true. In this case we will generate code that sets mask to a tcg_constant_tl() but then uses it as the LHS in tcg_gen_not_tl(). Fix the assertion by doing the invert in the translate time C code for the "mask is constant" case. Cc: qemu-stable@nongnu.org Fixes: f7ec91c ("target/ppc: Add IBM PPE42 special instructions") Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Glenn Miles <milesg@linux.ibm.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/qemu-devel/20260212150753.1749448-1-peter.maydell@linaro.org Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com> (cherry picked from commit 78c6b6010ce7cfa54874dda514e694640b76f1e4) Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>

ankita-nv changed the title ~~Qemu bugfixes for numa distance and huge pfnmap alignment~~ Qemu bugfixes for numa distance and huge pfnmap alignment for 10.2 Feb 19, 2026

ankita-nv added 3 commits February 22, 2026 05:36

ankita-nv force-pushed the nvidia_stable-10.2-ankita-bugfixes-0219 branch 2 times, most recently from 67aab4d to 329cf72 Compare February 24, 2026 06:30

ankita-nv force-pushed the nvidia_stable-10.2-ankita-bugfixes-0219 branch from 329cf72 to 9081ff7 Compare March 7, 2026 05:59

ankita-nv force-pushed the nvidia_stable-10.2-ankita-bugfixes-0219 branch from 9081ff7 to 78c52a1 Compare March 10, 2026 03:00

ankita-nv closed this by deleting the head repository Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.2#14

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.2#14
ankita-nv wants to merge 4 commits intoNVIDIA:nvidia_stable-10.2from
ankita-nv:nvidia_stable-10.2-ankita-bugfixes-0219

ankita-nv commented Feb 19, 2026

Uh oh!

ankita-nv commented Mar 30, 2026

Uh oh!

nvmochs commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankita-nv commented Feb 19, 2026

Uh oh!

ankita-nv commented Mar 30, 2026

Uh oh!

nvmochs commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants