Skip to content

Vulkan-exported dmabuf can't be imported by mlx5 ibverbs #1130

@bmerry

Description

@bmerry

NVIDIA Open GPU Kernel Modules Version

595.71.05

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.4 LTS

Kernel Release

Linux saraoamd 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 19 01:28:37 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA RTX PRO 5000 Blackwell

Describe the bug

I'm trying to do GPUDirect RDMA to have a ConnectX-7 NIC transmit data from GPU memory allocated by Vulkan, by exporting a dma-buf fd from Vulkan and importing it to ibverbs using ibv_reg_dmabuf_mr. The latter fails because it can't obtain the sg_table: it fails out at this line. I am using Vulkan memory with the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT set, so I'm guessing it must be in the BAR1 space and theoretically possible to map.

This sounds like it might be related to #1037, which is also about trying to use memory allocated on an NVIDIA GPU from other drivers.

To Reproduce

  • RDMA-capable GPU, with driver 595.71.05
  • mlx5-based NIC e.g. ConnectX-7, with DOCA 3.3.0. Mine is in Ethernet mode and configured with an IP address etc, but I doubt that makes a difference.
  • Vulkan libraries installed.
  • Compile this code with gcc -o nvidia-vulkan-dmabuf nvidia-vulkan-dmabuf.c -Wall -g -lvulkan -libverbs
  • Run it.

Output I get (the FD may change from run to run):

Successfully obtained dma-buf: fd = 49
Using IBV device mlx5_0
ibv_reg_dmabuf_mr failed: Cannot allocate memory

Expected output: "Successfully created MR from dmabuf"

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

The nvidia-bug-report output might show that I'm using a modified version of the kernel module; that's just pr_debug statements I added to narrow down the code flow. The problem was first observed with an unmodified version (and on 595.58.03).

I ticked the box to say this only happens on the open driver; I didn't actually test because AFAIK the VK_EXT_external_memory_dma_buf is only available on the open driver.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions