Skip to content

fix: Add sysfs fallback for RDMA detection on InfiniBand interfaces#9

Merged
k8s-ci-robot merged 2 commits intokubernetes-sigs:mainfrom
anson627:fix-rdma-detection-infiniband
Jan 19, 2026
Merged

fix: Add sysfs fallback for RDMA detection on InfiniBand interfaces#9
k8s-ci-robot merged 2 commits intokubernetes-sigs:mainfrom
anson627:fix-rdma-detection-infiniband

Conversation

@anson627
Copy link
Copy Markdown
Contributor

The rdmamap library v1.1.0 has a bug where it compares InfiniBand
interface hardware addresses against the node GUID instead of the
port GUID. This causes InfiniBand (IPoIB) interfaces to incorrectly
report dra.net/rdma=false even when they have RDMA capability.

Related upstream issue: https://github.com/Mellanox/rdmamap/issues/15

Root cause:
- IPoIB interfaces use port GUID (derived from GID)
- rdmamap library compares against node GUID
- The two GUIDs differ in byte 5 (0xfd vs 0xfe)

Changes:
- Add hasRDMADeviceInSysfs() helper that checks sysfs directly
- Update discoverRDMADevices() to fall back to sysfs check when
  rdmamap library returns false
- This workaround detects RDMA for both Ethernet (RoCE/iWARP) and
  InfiniBand interfaces correctly

The fix maintains backward compatibility by trying rdmamap first,
then falling back to sysfs only when needed.

Fixes: dra.net/rdma incorrectly showing false for InfiniBand devices

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @anson627!

It looks like this is your first PR to kubernetes-sigs/dranet 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/dranet has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 11, 2025
@gauravkghildiyal
Copy link
Copy Markdown
Member

Thanks for contributing @anson627 !

Am I understanding this correctly that once the upstream fix is made in Mellanox/rdmamap#15, we won't need a separate fix within DRANET? If yes, and assuming we can wait, we can try to get this upstreamed and then bump up the dependency with the fixed version

@anson627
Copy link
Copy Markdown
Contributor Author

Thanks for contributing @anson627 !

Am I understanding this correctly that once the upstream fix is made in Mellanox/rdmamap#15, we won't need a separate fix within DRANET? If yes, and assuming we can wait, we can try to get this upstreamed and then bump up the dependency with the fixed version

sure, opened a PR Mellanox/rdmamap#16 and will try to get review from Mellanox upstream

@MikeZappa87
Copy link
Copy Markdown
Contributor

@anson627 I dont see that rdma issue has moved. Is this PR still required?

@anson627
Copy link
Copy Markdown
Contributor Author

@MikeZappa87 I may need some help to get review from nvidia on Mellanox/rdmamap#16

@MikeZappa87
Copy link
Copy Markdown
Contributor

@MikeZappa87 I may need some help to get review from nvidia on Mellanox/rdmamap#16

I reached out to him on slack to see if he can review this.

@MikeZappa87
Copy link
Copy Markdown
Contributor

@gauravkghildiyal this seems stuck in a rock and a hardplace. @anson627 if this PR merges and then the PR you have in the other repo merges, nothing breaks correct?

@anson627
Copy link
Copy Markdown
Contributor Author

@gauravkghildiyal this seems stuck in a rock and a hardplace. @anson627 if this PR merges and then the PR you have in the other repo merges, nothing breaks correct?

yes, the current logic is fallback, I've tested with GB200 nodes on Azure

@gauravkghildiyal
Copy link
Copy Markdown
Member

@gauravkghildiyal this seems stuck in a rock and a hardplace.

@MikeZappa87 -- If this is a hardblocker for you, I think it's okay to merge this one. As you described, the change is backward compatible and implemented as a fallback. My intention earlier was to hopefully avoid any potential difference between the implementation in this PR (which checks the existence of a directory within /sys/class/net/{IFNAME}/device/infiniband) versus the upstream implementation which involves also matching the port's GUID.

In the interest of making some progress, I consider the implementation here to be "safe enough"

(Feel free to LGTM as you find appropriate)

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2026
Comment thread pkg/inventory/db.go Outdated
@gauravkghildiyal
Copy link
Copy Markdown
Member

/assign
/assign @MikeZappa87

@MikeZappa87
Copy link
Copy Markdown
Contributor

@gauravkghildiyal this seems stuck in a rock and a hardplace.

@MikeZappa87 -- If this is a hardblocker for you, I think it's okay to merge this one. As you described, the change is backward compatible and implemented as a fallback. My intention earlier was to hopefully avoid any potential difference between the implementation in this PR (which checks the existence of a directory within /sys/class/net/{IFNAME}/device/infiniband) versus the upstream implementation which involves also matching the port's GUID.

In the interest of making some progress, I consider the implementation here to be "safe enough"

(Feel free to LGTM as you find appropriate)

/approve

My preference would be to have the PR merge in the other repo however, it looks like that repos last PR was over two years ago. I attempted to reach out to a couple maintainers but got nothing. @anson627 fix might be the permenant one. I will try and get the other merged however I will approve for now.

@MikeZappa87
Copy link
Copy Markdown
Contributor

/approve

@aojea
Copy link
Copy Markdown
Contributor

aojea commented Jan 19, 2026

/lgtm
/approve

Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anson627, aojea, gauravkghildiyal, MikeZappa87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [aojea,gauravkghildiyal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit bc67047 into kubernetes-sigs:main Jan 19, 2026
7 of 9 checks passed
@anson627 anson627 deleted the fix-rdma-detection-infiniband branch February 25, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants