Skip to content

System Metrics#149

Draft
jonasbhend wants to merge 4 commits into
mainfrom
MRB-640-Add-system-metrics-to-dashboard
Draft

System Metrics#149
jonasbhend wants to merge 4 commits into
mainfrom
MRB-640-Add-system-metrics-to-dashboard

Conversation

@jonasbhend
Copy link
Copy Markdown
Contributor

@jonasbhend jonasbhend commented May 6, 2026

Example dashboard

file:///M:/zue-prod/fc_development/seamless/S-RUC/evaluation/MRB-640_dashboard.html

Context

As a SRUC developer, I want to assess system metrics (GPU usage, memory footprint, …) for the inference jobs for a given experiment. Similarly to model metrics, system metrics could be collected for individuals timesteps, aggregated, and visualized in the dashboard in a dedicated tab.

Change in rulegraph

rulegraph

@frazane
Copy link
Copy Markdown
Contributor

frazane commented May 6, 2026

I like the idea of collecting these metrics, but is parsing logs the best way to go? Perhaps a better approach would be to implement the inference program profiling in anemoi-inference directly?

Also, FYI: https://anemoi.readthedocs.io/projects/inference/en/latest/usage/optimisation.html#profiling-and-troubleshooting

@jonasbhend
Copy link
Copy Markdown
Contributor Author

jonasbhend commented May 6, 2026

I like the idea of collecting these metrics, but is parsing logs the best way to go? Perhaps a better approach would be to implement the inference program profiling in anemoi-inference directly?

Also, FYI: https://anemoi.readthedocs.io/projects/inference/en/latest/usage/optimisation.html#profiling-and-troubleshooting

Thanks for the hint! @frazane but even with inference program profiling we would still be reading logs to make this information accessible in the dashboard, no?

@frazane
Copy link
Copy Markdown
Contributor

frazane commented May 6, 2026

Sorry, I didn't mean we should use what I put in the FYI, but I thought it was relevant.

I was thinking more about something like structured, machine-readable logs or profiling results, possibly using existing tools like py-spy, memory_profiling, pytorch profiler, etc.

@jonasbhend
Copy link
Copy Markdown
Contributor Author

Sorry, I didn't mean we should use what I put in the FYI, but I thought it was relevant.

I was thinking more about something like structured, machine-readable logs or profiling results, possibly using existing tools like py-spy, memory_profiling, pytorch profiler, etc.

Sounds like a great alternative. Do we have experience in using such tools on our HPC systems? I clearly don't and wouldn't really know where to start. So in case this is where this should be going, I suggest someone else picks this up. @dnerini @cosunae thoughts?

@jonasbhend
Copy link
Copy Markdown
Contributor Author

Here is an example dashboard (all coded by our dear friend with minimal intervention from my side):

M:/zue-prod/fc_development/seamless/S-RUC/evaluation/MRB-640_dashboard.html

Do you think this is even remotely useful (there is virtually no spread in results with only three initialization being processed)? Would you expect other things to see (GPU usage, memory use ...)? Is there a better way than parsing the anemoi logs?

Thanks for your feedback: @dnerini @frazane @MicheleCattaneo @icedoom888

@radiradev
Copy link
Copy Markdown

I think as @frazane is suggesting it would be nice to have an extensive log using the torch profiler, I think https://github.com/gaogaotiantian/viztracer might be a good option, it seems to be lightweight and in an online format but the default solution is for the dashboard to be self-hosted.

@dnerini
Copy link
Copy Markdown
Member

dnerini commented May 6, 2026

Thanks @jonasbhend this looks already very good! I think what we're looking at in terms of visualization is some sort of distribution view given all individual runs included in an experiment. We did something similar (manually) for the SDL-25 some time ago.

Concerning the profiling approach, I agree that parsing the Anemoi logs is perhaps not the best approach. I wonder if we could rely on SLURM instead to collect some basic statistic, which would also decouple it from any specific ML framework.

I could also imagine having something very lightweight, SLURM-based that provides basic statistics for all runs, while something more involved could be used in parallel on only a few test runs to collect more detail information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants