Open
Conversation
Adds a CaloClusterGNN/ subdirectory containing the full training
pipeline for the GNN calorimeter-clustering algorithm intended as a
parallel to the existing seed+BFS CaloClusterMaker in Mu2e Offline.
The deployed recipe is "CCN+BFS10": CaloClusterNet edge classifier +
BFS-style traversal with ExpandCut = 10 MeV.
Layout (modelled on TrkQual/, but Python-package shaped):
CaloClusterGNN/
README.md how to retrain, frozen hyperparams,
deployment cross-link
setup_env.sh wraps setupmu2e-art.sh + ana 2.6.1
src/
data/ graph builder, calo-entrant truth labels,
normalisation, packed dataset
geometry/ crystalId -> (x, y, disk) loader
models/ SimpleEdgeNet, CaloClusterNet, layers, heads
training/ losses, metrics, trainer
inference/ cluster reconstruction (cluster_reco.py),
postprocess (kept here so train-time eval
scripts work end to end)
scripts/ build/pack/train/tune/evaluate pipeline,
failure audits, cluster-physics eval, ancestry
validation, run1B no-field eval, plotting
configs/ five YAML configs (one per training run)
tests/ 88 unit tests covering all of src/ above
splits/ frozen 35/7/8 v2 split file lists
data/ crystal_geometry.csv + crystal_neighbors.csv +
crystal_map_raw.csv (small lookup tables)
What does NOT live here:
* The deployment-side ONNX export / parity scripts (export_onnx.py,
export_norm_stats.py, validate_onnx.py, dump_parity_payloads.py,
compare_parity_dump.py) and the deploy wrappers
(calo_cluster_net_deploy.py, simple_edge_net_deploy.py) -- those
belong with the Mu2e/Offline integration PR, not the training repo.
* The `.onnx` artifacts themselves (shipped via Mu2e data area, not
versioned in MLTrain -- same convention TrkQual follows).
* Large run outputs and processed graphs (regenerable from
EventNtuple ROOT files via scripts/build_all_graphs.sh).
The v2 training data requires the `calomcsim.ancestorSimIds` branch
added in Mu2e/EventNtuple (PR pending). README cross-links there
once the EventNtuple PR has a number.
Test suite: 88/88 passing in this layout via
`python3 -m unittest discover -s tests -p "test_*.py" -v` after
`source setup_env.sh`.
Both trained models in CaloClusterGNN/ now have a complete training-
to-ONNX path inside MLTrain (consistent with the TrkQual pattern of
shipping conversion scripts alongside training).
New / restored:
* src/models/calo_cluster_net_deploy.py tensor-API wrapper around
CaloClusterNet (no PyG Data, no node-saliency head); used by ONNX
export so torch.onnx.export can trace it.
* src/models/simple_edge_net_deploy.py same shape for SimpleEdgeNet.
No node head to bypass, so it's a thin pass-through.
* scripts/export_onnx.py --model {ccn,sen} flag with
per-model presets (checkpoint, output path, model_version). Stamps
metadata_props {model_version, node_features, edge_features} into
the .onnx after export.
* scripts/export_norm_stats.py writes the train-split z-score
stats next to the .onnx as a flat JSON sidecar so the C++ side
doesn't need a LibTorch dep to read 28 floats.
* scripts/validate_onnx.py --model flag with per-model
preset for tau_edge and tolerance. Asserts:
- max abs-diff edge_logits within tol on the full val split
- zero per-edge threshold flips at tau_edge (proxy for cluster-
reco byte-equivalence with the deployed C++ pipeline)
* tests/test_calo_cluster_net_deploy.py (9 tests)
* tests/test_export_onnx.py (5 tests)
* tests/test_export_norm_stats.py (8 tests)
README extended with an "Exporting a Trained Model to ONNX" section
that documents the full chain for both models, the
metadata_props deployment contract, and the per-model frozen
tau_edge/tol values used by validate_onnx.py.
Test count goes from 88 to 110 (4 conditionally skipped on a fresh
checkout when no trained checkpoint is present locally; this is by
design and the skip messages name the missing file).
Also acknowledges Claude assistance in README.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
CaloClusterGNN/subdirectory containing the full trainingpipeline for a Graph Neural Network calorimeter-clustering algorithm,
intended to run alongside the existing seed+BFS
CaloClusterMakerinMu2e Offline. The deployed recipe is CCN+BFS10:
CaloClusterNetedge classifier followed by BFS-style traversal at
ExpandCut = 10 MeV.Layout follows the convention set by
TrkQual/andTrkPID/— onetop-level subdirectory per algorithm, self-contained.
What's in here
Two model classes train from the same pipeline:
tau_edgeSimpleEdgeNetCaloClusterNetBoth share the input graph (one per calorimeter disk per event,
6 node features + 8 edge features) and z-score normalisation, so
swapping models in deployment is config-only.
Headline result
On the MDC2025 mixed-pileup test set (276,688 events, 481,543
disk-graphs), CCN+BFS10 beats BFS on every downstream-relevant
cluster-physics metric for
E_reco >= 50 MeVclusters (those thatmatter for track finding):
In the 95-110 MeV signal region (47,279 clusters), mean abs(dE)
drops from 0.368 (BFS) to 0.210 (-43%) and mean dr drops from
0.559 mm to 0.460 mm (-18%).
Reproducibility
After
source setup_env.sh:Frozen hyperparameters and the exact recipe values live in
configs/and are documented in the README.
Coordinated PRs
calomcsim.ancestorSimIdstoSimInfo. The v2 training data uses calo-entrant ancestor truth,which requires this branch. Link this PR into the EventNtuple PR
once it has a number.
(
CaloHitGraphMaker,CaloClusterMakerGNN) underOffline/CaloCluster/. Loads the.onnxexported by this repovia
art::ConfigFileLookupPolicy, assertsmetadata_propsagreement (
model_version,node_features,edge_features)against FHiCL, and emits
CaloClusterCollectionunder instancename
"GNN"so existing BFS-reading analyses keep working.C++↔Python parity has been validated byte-exactly on the val
split (100/100 disk-graphs, 8,502 hits) using a parity-dump
analyzer + Python comparison harness.
Tests
The 4 skipped tests are conditional — they exercise loading a real
trained checkpoint or the exported
.onnxand self-skip with a clearmessage when those files aren't in the local checkout (the case for a
fresh clone).
Acknowledgement
Implementation, refactoring, and documentation drafting in this
subdirectory were assisted by Anthropic's Claude (Claude Code). All
scientific decisions, hyperparameter choices, validation results, and
the v1→v2 truth-definition campaign are my own work.
Notes:
number — easy edit on the GitHub PR page).
paste it into the web UI; it'll render the tables and code blocks correctly.