Skip to content

docs: align monitoring metric names with HAMi v2.9 exporters#498

Open
ntheanh201 wants to merge 2 commits into
Project-HAMi:masterfrom
ntheanh201:docs/metrics-v2.9
Open

docs: align monitoring metric names with HAMi v2.9 exporters#498
ntheanh201 wants to merge 2 commits into
Project-HAMi:masterfrom
ntheanh201:docs/metrics-v2.9

Conversation

@ntheanh201

@ntheanh201 ntheanh201 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

The monitoring docs still list the pre-v2.9 metric names (CamelCase) and old label
keys. On a running v2.9 cluster the exporters emit Prometheus-style hami_*
snake_case names with renamed labels, and the device-plugin endpoint no longer
exposes the per-container series. This updates the docs to match the actual output.

Changes

  • userguide/monitoring/device-allocation.md — scheduler endpoint (:31993):
    rename the 8 metrics to their hami_* names and fix the label keys
    (device_index/device_uuid/node/container_index/namespace/pod/quota_name).
  • userguide/monitoring/real-time-device-usage.md — device-plugin endpoint (:31992):
    v2.9 only exposes hami_host_gpu_utilization_ratio and
    hami_host_gpu_memory_used_bytes; renamed those and removed the four
    per-container rows that are no longer emitted.
  • userguide/monitoring/grafana-dashboard.md — same :31992 correction.
  • Mirrored the same fixes into versioned_docs/version-v2.9.0/.

Metric name mapping:

old v2.9
GPUDeviceCoreLimit hami_gpu_core_limit_ratio
GPUDeviceMemoryLimit hami_gpu_memory_limit_bytes
GPUDeviceCoreAllocated hami_gpu_core_allocated_ratio
GPUDeviceMemoryAllocated hami_gpu_memory_allocated_bytes
GPUDeviceSharedNum hami_gpu_shared_count
vGPUCoreAllocated hami_vgpu_core_allocated_ratio
vGPUMemoryAllocated hami_vgpu_memory_allocated_bytes
QuotaUsed hami_resource_quota_used
HostCoreUtilization hami_host_gpu_utilization_ratio
HostGPUMemoryUsage hami_host_gpu_memory_used_bytes

How verified

Curled both endpoints on a v2.9.0 cluster and used the live output for the names,
labels, and example values:

curl <scheduler-ip>:31993/metrics   # hami_gpu_*, hami_vgpu_*, hami_resource_quota_used, ...
curl <gpu-node-ip>:31992/metrics    # hami_host_gpu_utilization_ratio, hami_host_gpu_memory_used_bytes

Summary by CodeRabbit

  • Documentation
    • Updated monitoring guides to reflect the latest metric names and label keys.
    • Replaced older GPU, vGPU, and quota metric references with new hami_-prefixed host and device metrics.
    • Revised Grafana and real-time usage examples to show host-level GPU utilization and memory usage metrics.
    • Aligned the versioned docs with the current Prometheus metric schema for device allocation and usage.

Signed-off-by: The Anh Nguyen <ntheanh201@gmail.com>
@hami-robot

hami-robot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ntheanh201
Once this PR has been reviewed and has the lgtm label, please assign windsonsea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot requested review from wawa0210 and windsonsea June 26, 2026 02:35
@netlify

netlify Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploy Preview for project-hami ready!

Name Link
🔨 Latest commit 098b7dc
🔍 Latest deploy log https://app.netlify.com/projects/project-hami/deploys/6a3de7915bbac80008e08ea7
😎 Deploy Preview https://deploy-preview-498--project-hami.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@hami-robot hami-robot Bot added the size/M label Jun 26, 2026
Signed-off-by: The Anh Nguyen <ntheanh201@gmail.com>
@rootsongjc

Copy link
Copy Markdown
Contributor

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Monitoring documentation updates replace older metric names and label keys with updated HAMi metric names across the device allocation, real-time device usage, and Grafana dashboard guides, including the versioned v2.9.0 copies.

Changes

Monitoring metric documentation

Layer / File(s) Summary
Device allocation metrics
docs/userguide/monitoring/device-allocation.md, versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md
The cluster device allocation metric tables replace older GPUDevice, vGPU, and QuotaUsed entries and label keys with updated HAMi metric names and label schemas.
Real-time usage metrics
docs/userguide/monitoring/real-time-device-usage.md, versioned_docs/version-v2.9.0/userguide/monitoring/real-time-device-usage.md
The real-time device usage metric tables now list host-level GPU utilization and memory metrics with updated label keys.
Grafana key metrics
docs/userguide/monitoring/grafana-dashboard.md, versioned_docs/version-v2.9.0/userguide/monitoring/grafana-dashboard.md
The Grafana dashboard key metrics tables now list host-level GPU utilization and memory usage metrics instead of the prior container-focused entries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

Hop hop, I read the metric parade,
With hami_ carrots freshly displayed. 🐰
The labels now twinkle, neat and clear,
In docs both new and versioned here.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main documentation change: aligning monitoring metric names with HAMi v2.9 exporters.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/userguide/monitoring/device-allocation.md`:
- Around line 16-23: The example label set for hami_gpu_memory_allocated_bytes
uses device_cores, which is inconsistent with the other device labels and
appears to be a typo. Update the example in device-allocation.md to use the
correct label key consistent with the rest of the table, and keep the rest of
the hami_gpu_memory_allocated_bytes metric example unchanged. Make sure the
corrected label matches the naming used by the other device allocation metrics
and examples.

In `@versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md`:
- Around line 16-23: The example label key in the device allocation metrics
table is inconsistent: the hami_gpu_memory_allocated_bytes row uses device_cores
while the surrounding metrics use device_index and device_uuid. Update that row
in device-allocation.md to use the correct label key so the example matches the
rest of the table and the metric naming pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e7d7e6e9-35ea-49e5-bdce-408b58be5098

📥 Commits

Reviewing files that changed from the base of the PR and between 303f9ca and 098b7dc.

📒 Files selected for processing (6)
  • docs/userguide/monitoring/device-allocation.md
  • docs/userguide/monitoring/grafana-dashboard.md
  • docs/userguide/monitoring/real-time-device-usage.md
  • versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md
  • versioned_docs/version-v2.9.0/userguide/monitoring/grafana-dashboard.md
  • versioned_docs/version-v2.9.0/userguide/monitoring/real-time-device-usage.md

Comment on lines +16 to +23
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix the example label key.

device_cores is inconsistent with the rest of the table and looks like a typo. That makes the example query misleading.

Proposed fix
-| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
+| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/userguide/monitoring/device-allocation.md` around lines 16 - 23, The
example label set for hami_gpu_memory_allocated_bytes uses device_cores, which
is inconsistent with the other device labels and appears to be a typo. Update
the example in device-allocation.md to use the correct label key consistent with
the rest of the table, and keep the rest of the hami_gpu_memory_allocated_bytes
metric example unchanged. Make sure the corrected label matches the naming used
by the other device allocation metrics and examples.

Comment on lines +16 to +23
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix the example label key.

device_cores is inconsistent with the rest of the table and looks like a typo. That makes the example query misleading.

Proposed fix
-| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
+| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |
| hami_gpu_core_limit_ratio | Device core limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 100 |
| hami_gpu_memory_limit_bytes | Device memory limit for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 3.4359738368e+10 |
| hami_gpu_core_allocated_ratio | Device core allocated for a certain GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 45 |
| hami_gpu_memory_allocated_bytes | Device memory allocated for a certain GPU | `{device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"}` 3.221225472e+09 |
| hami_gpu_shared_count | Number of containers sharing this GPU | `{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"}` 1 |
| hami_vgpu_core_allocated_ratio | vGPU core allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 50 |
| hami_vgpu_memory_allocated_bytes | vGPU memory allocated from a container | `{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"}` 3.221225472e+09 |
| hami_resource_quota_used | resourcequota usage for a certain device | `{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"}` 100 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@versioned_docs/version-v2.9.0/userguide/monitoring/device-allocation.md`
around lines 16 - 23, The example label key in the device allocation metrics
table is inconsistent: the hami_gpu_memory_allocated_bytes row uses device_cores
while the surrounding metrics use device_index and device_uuid. Update that row
in device-allocation.md to use the correct label key so the example matches the
rest of the table and the metric naming pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants