docs: align monitoring metric names with HAMi v2.9 exporters#498
docs: align monitoring metric names with HAMi v2.9 exporters#498ntheanh201 wants to merge 2 commits into
Conversation
Signed-off-by: The Anh Nguyen <ntheanh201@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ntheanh201 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for project-hami ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Signed-off-by: The Anh Nguyen <ntheanh201@gmail.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
📝 WalkthroughWalkthroughMonitoring documentation updates replace older metric names and label keys with updated HAMi metric names across the device allocation, real-time device usage, and Grafana dashboard guides, including the versioned v2.9.0 copies. ChangesMonitoring metric documentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/userguide/monitoring/device-allocation.md`:
- Around line 16-23: The example label set for hami_gpu_memory_allocated_bytes
uses device_cores, which is inconsistent with the other device labels and
appears to be a typo. Update the example in device-allocation.md to use the
correct label key consistent with the rest of the table, and keep the rest of
the hami_gpu_memory_allocated_bytes metric example unchanged. Make sure the
corrected label matches the naming used by the other device allocation metrics
and examples.
In `@versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md`:
- Around line 16-23: The example label key in the device allocation metrics
table is inconsistent: the hami_gpu_memory_allocated_bytes row uses device_cores
while the surrounding metrics use device_index and device_uuid. Update that row
in device-allocation.md to use the correct label key so the example matches the
rest of the table and the metric naming pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e7d7e6e9-35ea-49e5-bdce-408b58be5098
📒 Files selected for processing (6)
docs/userguide/monitoring/device-allocation.mddocs/userguide/monitoring/grafana-dashboard.mddocs/userguide/monitoring/real-time-device-usage.mdversioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.mdversioned_docs/version-v2.9.0/userguide/monitoring/grafana-dashboard.mdversioned_docs/version-v2.9.0/userguide/monitoring/real-time-device-usage.md
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | ||
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | ||
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | ||
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | ||
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | ||
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | ||
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | ||
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Fix the example label key.
device_cores is inconsistent with the rest of the table and looks like a typo. That makes the example query misleading.
Proposed fix
-| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
+| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | |
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | |
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | |
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | |
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | |
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | | |
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | |
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | |
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | |
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | |
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | |
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/userguide/monitoring/device-allocation.md` around lines 16 - 23, The
example label set for hami_gpu_memory_allocated_bytes uses device_cores, which
is inconsistent with the other device labels and appears to be a typo. Update
the example in device-allocation.md to use the correct label key consistent with
the rest of the table, and keep the rest of the hami_gpu_memory_allocated_bytes
metric example unchanged. Make sure the corrected label matches the naming used
by the other device allocation metrics and examples.
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | ||
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | ||
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | ||
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | ||
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | ||
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | ||
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | ||
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Fix the example label key.
device_cores is inconsistent with the rest of the table and looks like a typo. That makes the example query misleading.
Proposed fix
-| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
+| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | |
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | |
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | |
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | |
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | |
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | | |
| | hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 | | |
| | hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 | | |
| | hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 | | |
| | hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 | | |
| | hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 | | |
| | hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 | | |
| | hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md`
around lines 16 - 23, The example label key in the device allocation metrics
table is inconsistent: the hami_gpu_memory_allocated_bytes row uses device_cores
while the surrounding metrics use device_index and device_uuid. Update that row
in device-allocation.md to use the correct label key so the example matches the
rest of the table and the metric naming pattern.
What
The monitoring docs still list the pre-v2.9 metric names (CamelCase) and old label
keys. On a running v2.9 cluster the exporters emit Prometheus-style
hami_*snake_case names with renamed labels, and the device-plugin endpoint no longer
exposes the per-container series. This updates the docs to match the actual output.
Changes
userguide/monitoring/device-allocation.md— scheduler endpoint (:31993):rename the 8 metrics to their
hami_*names and fix the label keys(
device_index/device_uuid/node/container_index/namespace/pod/quota_name).userguide/monitoring/real-time-device-usage.md— device-plugin endpoint (:31992):v2.9 only exposes
hami_host_gpu_utilization_ratioandhami_host_gpu_memory_used_bytes; renamed those and removed the fourper-container rows that are no longer emitted.
userguide/monitoring/grafana-dashboard.md— same:31992correction.versioned_docs/version-v2.9.0/.Metric name mapping:
GPUDeviceCoreLimithami_gpu_core_limit_ratioGPUDeviceMemoryLimithami_gpu_memory_limit_bytesGPUDeviceCoreAllocatedhami_gpu_core_allocated_ratioGPUDeviceMemoryAllocatedhami_gpu_memory_allocated_bytesGPUDeviceSharedNumhami_gpu_shared_countvGPUCoreAllocatedhami_vgpu_core_allocated_ratiovGPUMemoryAllocatedhami_vgpu_memory_allocated_bytesQuotaUsedhami_resource_quota_usedHostCoreUtilizationhami_host_gpu_utilization_ratioHostGPUMemoryUsagehami_host_gpu_memory_used_bytesHow verified
Curled both endpoints on a v2.9.0 cluster and used the live output for the names,
labels, and example values:
Summary by CodeRabbit
hami_-prefixed host and device metrics.