RTX PRO 6000 Blackwell: Xid 8 / GSP watchdog timeout under sustained SGLang FP8 inference

## Summary

RTX PRO 6000 Blackwell Workstation Edition crashes under sustained LLM inference using SGLang with FP8 quantized model. The GPU enters an unrecoverable state requiring full system reboot.

## System Information

- **GPU**: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB)
- **Driver**: nvidia-open 595.71.05 (Ubuntu package)
- **OS**: Ubuntu 24.04.4 LTS
- **Kernel**: 6.17.0-29-generic
- **Workload**: SGLang serving Qwen3 FP8 model (--max-running-requests 16)

## Crash Signature

```
NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
NVRM: Xid (PCI:0000:01:00): 8, pid=48694, name=python3, channel 0x00000008
```

## What Was Tried

- Reduced `--max-running-requests` from 16 to 8 (pending test)
- Power limit is already at 480W (default 600W) - not hitting power wall
- Temperature was normal at crash time (no thermal throttling in logs)

## Related Issues

This appears to be the same GSP firmware halt class as:
- #1111 - RTX PRO 6000 Blackwell + llama.cpp inference (silent hard hang)
- #1080 - RTX 5090 GB202 + Vulkan/LLM load (GSP heartbeat timeout → Xid 109/8)

The firmware RE analysis in #1080 confirms the root cause is missing GPU reset recovery path for Blackwell in the kernel driver.

## Expected Behavior

GPU should remain stable under sustained LLM inference workloads, or at minimum have a working recovery path that doesn't require full system reboot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX PRO 6000 Blackwell: Xid 8 / GSP watchdog timeout under sustained SGLang FP8 inference #1159

Summary

System Information

Crash Signature

What Was Tried

Related Issues

Expected Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RTX PRO 6000 Blackwell: Xid 8 / GSP watchdog timeout under sustained SGLang FP8 inference #1159

Description

Summary

System Information

Crash Signature

What Was Tried

Related Issues

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions