Skip to content

RTX 5070 (GB205): __nv_drm_gem_nvkms_map composes a mapping that spans BAR1→BAR3, causing mapping_reuse.c:273 NV_ERR_NO_MEMORY and krcWatchdog GPU lock — driver 595.71.05 (open kernel modules), Resizable BAR disabled #1132

@BadPackage

Description

@BadPackage

NVIDIA Open GPU Kernel Modules Version

595.71.05 (Open Kernel Modules, Release Build, built 2026-04-24, builder dvs-builder@U22-I3-G08-03-1)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora Linux 44 (Workstation Edition)

Kernel Release

6.19.14-300.fc44.x86_64 - by Fedora

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 5070 (GB205, PCI ID 10de:2f04 rev a1, ASUSTeK subsystem 1043:89e6).

Describe the bug

After long uptime of normal Wayland desktop use with many GPU-accelerated clients, the NVIDIA driver attempts a DMA mapping whose computed range crosses the boundary between PCI BAR1 and BAR3. The Linux PCI resource layer rejects it with a "resource sanity check" warning, the driver returns NV_ERR_NO_MEMORY from mapping_reuse.c:273, the nvidia-drm atomic-modeset helper then fails to initialize a plane fence semaphore, and the GPU's recovery watchdog declares the GPU "probably locked" and continues to fire indefinitely. The display session becomes unrecoverable; restarting gdm is the lightest path back, full reboot is sometimes required. At the moment of the first failure, only ~1.7 GiB of 12 GiB VRAM was in use — this is not VRAM exhaustion in bytes, it appears to be exhaustion or fragmentation of the BAR1 mapping window.

Relevant hardware state

PCI BAR layout for the GPU (lspci -vv -s 08:00.0):

Region 0: Memory at f8000000 (32-bit, non-prefetchable) [size=64M]
Region 1: Memory at d0000000 (64-bit, prefetchable)     [size=256M]   ← BAR1
Region 3: Memory at e0000000 (64-bit, prefetchable)     [size=32M]    ← BAR3
Capabilities: [134 v1] Physical Resizable BAR
Capabilities: [140 v1] Virtual Resizable BAR

The GPU advertises both Physical and Virtual Resizable BAR capabilities, but the system has Resizable BAR disabled. Per /proc/driver/nvidia/params:

EnableResizableBar: 0

So BAR1 stays at 256 MiB rather than being resized to span the full 12 GiB of VRAM. This appears to be the predisposing condition for the failure.

Kernel log of the failure

Most informative dmesg excerpt (kernel timestamps, single uptime, in order):

[168814.845214] NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.
[168814.845222] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
                returned from pReuseMappingDb->pMapCb(pReuseMappingDb->pGlobalCtx, pAllocCtx, range,
                cachingFlags, &token, _reusemappingdbAddMappingCallback) @ mapping_reuse.c:273
[168814.845231] NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.
[168814.845313] NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.
[168814.845434] resource: resource sanity check: requesting [mem 0x00000000df550000-0x00000000e013ffff],
                which spans more than 0000:08:00.0 [mem 0xd0000000-0xdfffffff 64bit pref]
[168814.845438] caller __nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm] mapping multiple BARs
[168821.090046] [drm:__nv_drm_convert_in_fences [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800]
                Failed to initialize semaphore for plane fence
[168821.090058] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800]
                Failed to apply atomic modeset.  Error code: -11
[168824.609543] NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
[168832.801431] NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
... (krcWatchdog_IMPL repeats every ~8 s indefinitely) ...
[169656.854308] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800]
                Flip event timeout on head 3
... (kernel falls back to fbcon shortly after) ...

The driver requested 0xdf550000–0xe013ffff (≈ 12 MiB). That range starts inside BAR1 (0xd0000000–0xdfffffff, 256 MiB) and ends inside BAR3 (0xe0000000–0xe1ffffff, 32 MiB). The Linux PCI core rejects the mapping with EAGAIN (-11) because it spans two distinct BARs.

Suspected root cause

Likely predisposing factor: ReBAR disabled keeps BAR1 at 256 MiB despite GB205 having 12 GiB of VRAM and advertising the rebar capability. The mapping computed by the driver under sustained mapping-reuse churn (pReuseMappingDb, mapping_reuse.c:273) crossed the BAR1→BAR3 boundary, as recorded by the kernel sanity check. The driver does not handle the rejection cleanly — it returns NV_ERR_NO_MEMORY upward, leaving the GPU in a state the RC watchdog cannot recover from.

Concurrent system state at first failure

$ nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free,driver_version --format=csv
NVIDIA GeForce RTX 5070, 12227 MiB, 1710 MiB, 10062 MiB, 595.71.05
$ uptime    # at SSH login, after the lockup was already in progress
 23:46:58 up 1 day, 23:01,  3 users,  load average: 2.60, 1.92, 1.32
$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        27Gi       3.2Gi       1.8Gi        35Gi        35Gi
Swap:          8.0Gi        71Mi       7.9Gi

Host CPU and system RAM are not under pressure. VRAM is mostly free. Failure is GPU-side only.

Kernel command line

BOOT_IMAGE=(hd4,gpt4)/vmlinuz-6.19.14-300.fc44.x86_64 root=UUID=... ro rootflags=subvol=root
rd.luks.uuid=luks-... rhgb quiet nvidia-drm.modeset=1 snd_hda_intel.power_save=0
rd.driver.blacklist=nouveau,nova_core modprobe.blacklist=nouveau,nova_core

nouveau and nova_core are blacklisted. nvidia-drm.modeset=1 is set.

Compositor / userspace

GNOME Shell on Wayland (gdm), stock Fedora xorg-x11-drv-nvidia userspace at 595.71.05.

To Reproduce

Difficult to reproduce on demand. Observed once after very long uptime:

  1. Boot Fedora 44, GNOME on Wayland, NVIDIA open kernel modules 595.71.05 (nvidia-drm.modeset=1, ReBAR disabled in BIOS).
  2. Use the desktop normally over ~47 hours with a busy mix of GPU-accelerated clients:
    • Brave Browser (~15 windows, ~20 renderer/utility processes)
    • Discord (Electron)
    • Spotify (Electron)
    • Steam + steamwebhelper
    • RustRover (JetBrains)
    • Xwayland hosting several X11 clients
    • GNOME Shell + extensions
  3. After roughly that uptime, dmaAllocMapping failures begin appearing, immediately followed by the BAR-spanning sanity-check warning and krcWatchdog.
  4. The desktop session becomes unresponsive; SSH still works, no Xid is logged, GSP appears healthy (no Xid 119), the kernel just keeps logging the watchdog every ~8 s until gdm is restarted or the box is rebooted.

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

GitHub-issue searches that returned no existing match:

  • dmaAllocMapping
  • mapping_reuse
  • krcWatchdog GPU is probably locked
  • NV_ERR_NO_MEMORY VA
  • GB205
  • __nv_drm_gem_nvkms_map mapping multiple BARs

Function-name note

The log says dmaAllocMapping_GM107 despite the GPU being Blackwell (GB205). Appears to be a legacy-named symbol still in use across HALs — flagged in case it is informative.

Workaround being tried

Will enable Resizable BAR in BIOS and re-test. If that prevents recurrence, this points strongly at the BAR1-size / mapping-reuse interaction described above. Will update this issue with the result either way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions