Skip to content

RTX PRO 6000 Blackwell: Xid 8 / GSP watchdog timeout under sustained SGLang FP8 inference #1159

@HH1162

Description

@HH1162

Summary

RTX PRO 6000 Blackwell Workstation Edition crashes under sustained LLM inference using SGLang with FP8 quantized model. The GPU enters an unrecoverable state requiring full system reboot.

System Information

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB)
  • Driver: nvidia-open 595.71.05 (Ubuntu package)
  • OS: Ubuntu 24.04.4 LTS
  • Kernel: 6.17.0-29-generic
  • Workload: SGLang serving Qwen3 FP8 model (--max-running-requests 16)

Crash Signature

NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
NVRM: Xid (PCI:0000:01:00): 8, pid=48694, name=python3, channel 0x00000008

What Was Tried

  • Reduced --max-running-requests from 16 to 8 (pending test)
  • Power limit is already at 480W (default 600W) - not hitting power wall
  • Temperature was normal at crash time (no thermal throttling in logs)

Related Issues

This appears to be the same GSP firmware halt class as:

The firmware RE analysis in #1080 confirms the root cause is missing GPU reset recovery path for Blackwell in the kernel driver.

Expected Behavior

GPU should remain stable under sustained LLM inference workloads, or at minimum have a working recovery path that doesn't require full system reboot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions