Skip to content

RTX 3090 Ti (Ampere): suspend hangs on Nth S3 cycle — virtualBar2[GPU_GFID_PF].pCpuMapping == NULL on resume after per-cycle NVLink unilateral-shutdown accumulation — kernel 7.0.3, nvidia-open 595.71.05 #1148

@BenStringer3

Description

@BenStringer3

NVIDIA Open GPU Kernel Modules Version

595.71.05 (Arch package nvidia-open-dkms 595.71.05-2)

Please confirm this issue does not happen with the proprietary driver (of the same version).

  • I confirm that this does not happen with the proprietary driver package.

Arch Linux retired the proprietary nvidia-dkms package in the 590.x series transition;
the proprietary kernel modules for 595.x are not available via the package manager and
would require NVIDIA's .run installer. This test has not been performed. I am willing
to do so if maintainers consider it necessary to isolate the regression.

Operating System and Version

Arch Linux (rolling release)

Kernel Release

Linux 7.0.3-arch1-2 #1 SMP PREEMPT_DYNAMIC Fri, 01 May 2026 15:49:22 +0000 x86_64

Hardware: GPU

NVIDIA GeForce RTX 3090 Ti (Ampere GA102, PCI 0000:08:00.0)

Single-GPU desktop. No NVLink peers. AMD Raphael iGPU present but blacklisted — confirmed
not loaded during any of the three failed sessions.

Describe the bug

S3 (deep) suspend/resume fails, but not on every resume. The failure always falls on
the last resume of the boot session after N prior successful cycles. Across three
sessions: N = 11, 6, and 12. Sleep duration does not predict the hang — 24-hour and
15-hour sleeps succeeded in sessions that later failed.

The third failure (2026-05-17) produced visible host-driver assertion failures showing
virtualBar2[GPU_GFID_PF].pCpuMapping is NULL at resume. The first two failures
(2026-05-07, 2026-05-10) were completely silent — no assertions, no panic, no SSH
response after wake.

In all three sessions, every S3 suspend logged exactly one instance of:

NVRM: knvlinkCoreShutdownDeviceLinks_IMPL: Need to shutdown all links unilaterally for GPU0

one-to-one with each PM: suspend entry (deep). The RTX 3090 Ti has no NVLink peers, so
the driver always takes the unilateral fallback on every cycle. We believe repeated
traversal of this path accumulates state corruption in the GMMU VA allocator that
eventually prevents virtualBar2[GPU_GFID_PF].pCpuMapping from being restored on resume.

Failure signature (third incident — only time assertions were visible)

NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NULL != pIter->pMap @ virt_mem_allocator_gm107.c:2024
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: progress == entryIndexHi - entryIndexLo + 1 @ mmu_walk_map.c:170
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:541
NVRM: GPU0 mmuWalkMap: Failed to map VA Range 0x86000000 to 0x865fffff. Status = 0x00000040
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_map.c:75
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:826
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk.c:391
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:826
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: progress == 1 @ mmu_walk.c:1522
NVRM: GPU0 mmuWalkUnmap: Failed to unmap VA Range 0x86000000 to 0x865fffff. Status = 0x00000040
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:62
NVRM: GPU0 mmuWalkMap: Unmap failed with status = 0x00000040
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NV_OK == unmapStatus @ mmu_walk_map.c:84
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NV_OK == status @ gpu_vaspace.c:2036
NVRM: GPU0 nvCheckFailedNoLog: Check failed: NV_OK == status @ virt_mem_allocator_gm107.c:2552
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (pKernelBus->pReadToFlush != NULL ||
  pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL) @ kern_bus_gv100.c:388

The final assertion is the root observable: virtualBar2[GPU_GFID_PF].pCpuMapping is
NULL on resume. This is the host CPU's mapping into PCIe BAR2, used for all internal GPU
memory access. Without it, every GMMU walk fails in a cascade. PM: suspend exit was
never logged — the resume never completed.

What has been ruled out

Hypothesis

Each S3 suspend cycle traverses the NVLink unilateral-shutdown path
(knvlinkCoreShutdownDeviceLinks_IMPL), leaking or incompletely cleaning up VA allocator
state in the GMMU (virt_mem_allocator_gm107.c, mmu_walk.c). After N cycles, the VA
space used to establish virtualBar2[GPU_GFID_PF].pCpuMapping is exhausted or corrupted.
The next resume finds pCpuMapping == NULL, fails every GMMU walk, and freezes.

Variable N (6, 11, 12) is consistent with a per-cycle leak whose accumulation depends on
the driver's initial VA space state at boot.

Relation to issue #1134

Issue #1134 reports the same kernel (7.0.3-arch1-2), driver (595.71.05), GPU family
(RTX 3090), and OS (Arch Linux, Hyprland). That reporter observed
dmaAllocMapping_GM107: can't alloc VA space for mapping and NV_ERR_NO_MEMORY (0x51)
from virt_mem_allocator_gm107.c → Xid 31 → Xid 154, triggered by a Chromium renderer
process. The VA allocator is shared; the trigger path (DRM GEM ops vs. S3 PM callbacks
via NVLink shutdown) and depleted resource (BAR1 userspace window vs. BAR2 internal
aperture) differ. These may be distinct bugs in the same allocator, or the same leak via
different trigger paths.

Bug Incidence

Three occurrences across three boot sessions. No deterministic per-step reproducer —
accumulation takes multiple suspend cycles within a single kernel session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions