Skip to content

HDDS-14825. Add Grafana Dashboard and Metrics for ZDU#10602

Draft
errose28 wants to merge 31 commits into
apache:HDDS-14496-zdufrom
errose28:worktree-version-metrics
Draft

HDDS-14825. Add Grafana Dashboard and Metrics for ZDU#10602
errose28 wants to merge 31 commits into
apache:HDDS-14496-zdufrom
errose28:worktree-version-metrics

Conversation

@errose28

@errose28 errose28 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Changes generated by Claude Code with a spec, reviews, and edits by me.

Metrics

  • Added build info: revision (git hash/string), component (string), and version (string)
    • This information was already present in every build, but was not exposed as metrics.
    • To expose the strings, they are added as labels to a gauge with a constant value of 1. Thanks @octachoron for the tip.
  • Added S3 gateway client version
    • S3 Gateway does not use software or apparent version since it is stateless. It does contain an Ozone client so expose the client version metric instead.
  • Added a gauge to track the presence of the OM DB finalizing marker (1 if present, 0 otherwise), which can be used to check if a finalization is in progress.

Software and apparent version metrics for all relevant components were already published.

This PR removes one @Metric annotation as a workaround for #10523 which is still pending merge on master. We should be able to merge either PR independently and reconcile them when they both land.

Dashboard

A Grafana dashboard was added to assist admins as they are orchestrating the upgrade. Since this depends on the new metrics, it will only be usable when upgrading from the initial version that supports ZDU (just like the ZDU feature itself). However once the metrics are present it could be helpful for even a non-rolling upgrade.

Since this dashboard was designed with admins in mind, it does not expose software version, apparent version, or client versions which are internal to the cluster. It only exposes admin facing properties like "finalized" as a boolean state and the build version string. Internal version info is still accessible with PromQL for more dev focused debugging as needed.

All panels were designed to account for large clusters so the dashboard remains readable even when there are 1000+ nodes. The tables are paginated and all other values are aggregates. The selectors at the top of the dashboard support drilling down to specific components as needed, while the banner at the top alerts that a filtered view is active.

image Screenshot 2026-06-24 at 7 00 16 PM Screenshot 2026-06-24 at 7 01 16 PM

What is the link to the Apache JIRA

HDDS-14825

How was this patch tested?

Unit tests for the new metrics were added.

The dashboard can be manually viewed from Grafana in a local docker environment:

cd hadoop-ozone/dist/target/ozone-*/compose/ozone
COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose up --scale datanode=3 -d
# Go to http://localhost:3000/dashboards and select "Ozone - Rolling Upgrade"
# To tear down:
COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose down

The dashboard will need a few seconds to populate the values. Also zoom in the time interval to the last few minutes since the default 30 minute window will be hard to read when the cluster has only been live for a few seconds.

By default this will run with all nodes finalized and in the same version. To see the dashboard with a simulated in progress upgrade, build Ozone with the following patch applied. Note that these injected values are only to demonstrate a range of possibilities on the dashboard at once. They do not reflect a realistic state for the cluster during an upgrade.

diff --git b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
index f8ce4cbdc0..68aace6544 100644
--- b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
+++ a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
@@ -74,9 +74,18 @@ public static synchronized BuildInfoMetrics create(String component) {
   public void getMetrics(MetricsCollector collector, boolean all) {
     MetricsRecordBuilder builder = collector.addRecord(RECORD_NAME)
         .add(new MetricsTag(
-            Interns.info("component", "Ozone component name"), component)).add(new MetricsTag(Interns.info("revision", "Source control revision"), revision))
-        .add(new MetricsTag(Interns.info("version", "Ozone build version"), version))
+            Interns.info("component", "Ozone component name"), component))
+        .add(new MetricsTag(
+            Interns.info("revision", "Source control revision"), revision))
         .addGauge(Interns.info("BuildInfo", "Always 1; identifying info is in labels"), 1L);
+
+    if (component.equals("hddsDatanode")) {
+      builder.add(new MetricsTag(Interns.info("version", "Ozone build version"), "2.1.0-TEST"));
+    } else {
+      builder.add(new MetricsTag(Interns.info("version", "Ozone build version"), version));
+    }
+
+
     builder.endRecord();
   }
 }
diff --git b/hadoop-ozone/dist/src/main/compose/ozone/docker-config a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
index ecca3a971c..0c16691f2d 100644
--- b/hadoop-ozone/dist/src/main/compose/ozone/docker-config
+++ a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
@@ -67,3 +67,8 @@ no_proxy=om,scm,s3g,recon,kdc,localhost,127.0.0.1
 
 # Explicitly enable filesystem snapshot feature for this Docker compose cluster
 OZONE-SITE.XML_ozone.filesystem.snapshot.enabled=true
+
+# Testing overrides for ZDU dashboard verification: start with apparent < software
+# to demonstrate divergence rendering. Revert before running acceptance tests.
+OZONE-SITE.XML_testing.ozone.om.init.apparent.version=7
+OZONE-SITE.XML_testing.hdds.scm.init.apparent.version=8

@errose28 errose28 requested review from dombizita and sodonnel June 24, 2026 23:03
@errose28 errose28 added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Jun 24, 2026
@errose28 errose28 marked this pull request as draft June 24, 2026 23:09
@errose28

Copy link
Copy Markdown
Contributor Author

The cluster finalization status stat panel in this dashboard currently has an issue because it is just checking if the software and apparent versions of each component match. This will give incorrect results if there is a release with no component version changes for a component. I'm looking at fixing this the same way we did for OM itself by publishing a metric based on the DB marker which indicates whether a finalize command was actually issued.

errose28 added 4 commits June 25, 2026 17:48
* HDDS-14496-zdu:
  HDDS-15622. New finalize command should check OM server version (apache#10548)
  HDDS-15609. Legacy SCM Finalize command should become a no-op (apache#10543)
  HDDS-15528. Adjust upgrade finalize command to call OM instread of SCM (apache#10493)
  HDDS-15488. Recon upgrade actions should be idempotent (apache#10442)
  HDDS-15482. Add fencing based on datanode versions to SCM and Recon (apache#10504)
  HDDS-15374. Switch Recon to the new versioning framework (apache#10443)
@errose28

Copy link
Copy Markdown
Contributor Author

I updated the status panel at the top of the dashboard to better reflect the state of the ongoing upgrade.

  • Upgrading (orange): There is a mix of build versions in the cluster
  • Unfinalized (yellow): Build versions match but there are components whose apparent version != software version
  • Finalizing (orange): Build versions match and the OM is tracking an ongoing finalization
  • Complete (green): The default state

This better handles the case where an upgrade was done but there were no new component versions introduced. This case will now show as "complete" to indicate that there is no more action required from the admin, instead of "finalized" which would be misleading since no finalize command was ever given and downgrade is still allowed.

This is made possible by a new metric added that indicates the presence of the OM DB finalizing marker. The mapping of metrics -> number -> stats is done with an AI generated promQL equation that is difficult to read but seems to work in practice.

Selection of yellow vs orange was somewhat arbitrary. I decided to use orange for the more transient states.

Screenshot 2026-06-25 at 7 56 40 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant