Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions CrvCosmic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# CrvCosmic

XGBoost-based classifier for rejecting cosmic-ray-induced backgrounds at the Mu2e experiment. The model takes per-coincidence CRV and tracker features as input and outputs a probability that an event is cosmic-induced. It is trained on `CosmicCRYSignalAllOnSpillTriggered` (pure cosmics) versus `CeEndpointMix2BBTriggered` (beam+pileup) datasets.

This code was ported from [`sam-grant/mu2e-cosmic`](https://github.com/sam-grant/mu2e-cosmic).

## Layout

```
CrvCosmic/
├── config/
│ └── cuts.yaml # Cutset definition
├── src/
│ ├── core/
│ │ ├── analyse.py # Cut flow + feature helpers
│ │ └── postprocess.py # combine_cut_flows / combine_hists / combine_arrays
│ ├── utils/
│ │ ├── io_manager.py # pkl + cuts.yaml load, pkl/csv/h5/parquet write
│ │ ├── hist_manager.py # histogram booking/filling for the summary plots
│ │ ├── draw.py # plot_summary_ml
│ │ └── mu2e.mplstyle
│ └── ml/
│ ├── process.py # MLProcessor: extracts features from ROOT files
│ ├── load.py # LoadML: read processed data and trained models
│ ├── assemble.py # AssembleDataset: label, shuffle, fold-split
│ ├── train.py # Train: fit XGBoost, K-fold CV, save UBJ
│ ├── validate.py # Validate: ROC, threshold scan, money table
│ └── optimise.py # Optimise: grid search over hyperparameters
├── run/
│ └── run_ml_prep.py # entrypoint: process ROOT files into parquet/pkl
└── notebooks/
├── assemble.ipynb # Validate assemble module
├── feature_engineering.ipynb # Feature evaluation
├── optimise.ipynb # Hyperparameter tuning
├── train.ipynb # Training
└── validate.ipynb # Validation
```

The non-ML pieces of the upstream cosmic-background framework are not included here — only the cuts and helpers that the ML preselection actually exercises.

## Pipeline

1. **Process**: read ROOT EventNtuple files, apply the `MLPreprocess` cutset,
flatten per-coincidence features into awkward/parquet output.
2. **Assemble**: load processed CRY and CE-mix outputs, label, combine, and
produce K-fold indices for nested cross-validation.
3. **Train**: fit XGBoost on each fold, find the per-fold operating threshold at
a target veto efficiency, then retrain on the full set with the
CV-averaged threshold and export to `.ubj`.
4. **Validate**: compute ROC AUC, score distributions, feature importance, and
a "money table" comparing the ML model against the simple `dT` cut baseline.
5. **Optimise** (optional): grid search over XGBoost hyperparameters using
K-fold CV; minimises deadtime at the target veto efficiency.

## Setup

On a Mu2e VM or EAF:

```bash
mu2einit
pyenv ana
```

The code depends on [`Mu2e/pyutils`](https://github.com/Mu2e/pyutils) (provided by `pyenv ana`) — `pyutils.pycut.CutManager` is used directly — plus `xgboost` `scikit-learn`, `awkward`, `pandas`, `hist`, `pyarrow`, `h5py`, `joblib`, and `pyyaml`.

## Running the pipeline

### 1. Preprocess ROOT files into ML-ready features

```bash
cd run
python run_ml_prep.py # production: full datasets via xrootd
python run_ml_prep.py --test # test: two local files, if they exist
```

This populates `output/ml/{run}/data/{tag}/` with `events.parquet`, `results.pkl`, `hists.h5`, `stats.csv`, and `cut_flow.csv`. The default production `run_str` is set inside `run_ml_prep.py` (currently `"k"`); change it when starting a new training round.

### 2. Train and validate

```bash
cd notebooks
jupyter notebook
```


The final model is saved to `output/ml/{run}/results/final/model_{version}.ubj`, suitable for inference with the standard XGBoost C API.

## Configuring cuts

Cutsets are defined in `config/cuts.yaml`. The ML preprocessing uses the `MLPreprocess` cutset by default (set in `src/ml/process.py`). To modify the preselection, edit thresholds or active flags under `cutsets.MLPreprocess`. Defining a new cutset is also possible by adding an entry under `cutsets:` and overriding `cutset_name` when constructing `MLProcessor`.

## Output paths

All outputs land under `CrvCosmic/output/`:

- `output/ml/{run}/data/{tag}/` — processed per-coincidence features
- `output/ml/{run}/results/{tag}/` — trained models, threshold scans,
money tables, CSV summaries
- `output/images/ml/{run}/` — ROC curves, score distributions, feature
importance plots

These directories are created on demand. Symlink them to another disk if space is a concern.
72 changes: 72 additions & 0 deletions CrvCosmic/config/cuts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Samuel Grant 2025
# Define cutsets for cosmic-ray-induced background analysis

##########################################################
# Define baseline defaults (from sam-grant/mu2e-cosmic)
##########################################################
defaults:
description: "Baseline defaults"
thresholds:
track_pdg: 11
lo_nactive: 20
lo_trkqual: 0.8
lo_t0_ns: 640
hi_t0_ns: 1650
lo_maxr_mm: 450
hi_maxr_mm: 680
hi_d0_mm: 100
hi_t0err: 0.9
lo_pitch_angle: 0.5
hi_pitch_angle: 1.0
lo_crv_start_ns: 420.0
hi_crv_end_ns: 1700.0
lo_veto_dt_ns: -150.0
hi_veto_dt_ns: 150.0
lo_wide_win_mevc: 85
hi_wide_win_mevc: 200
lo_ext_win_mevc: 100
hi_ext_win_mevc: 110
lo_sig_win_mevc: 103.6
hi_sig_win_mevc: 104.9
active:
one_reco_electron: true
is_reco_electron: true
thru_trk: true
good_trkqual: true
within_t0: true
is_downstream: true
has_hits: true
within_t0err: true
within_d0: true
within_pitch_angle_lo: true
within_pitch_angle_hi: true
within_lhr_max_lo: false
within_lhr_max_hi: true
is_truth_electron: true
within_coinc_start_time: false
within_coinc_end_time: false
unvetoed: true
within_wide_win: true
within_ext_win: true
within_sig_win: true

##########################################################
# Define custom cutsets which are overridden from defaults
# You can turn cuts off and edit thresholds here
##########################################################
cutsets:
########################
# Analysis cut sets
########################
MLPreprocess:
description:
"Preselection cuts for ML"
thresholds:
lo_trkqual: 0.2
active:
within_coinc_start_time: false
within_coinc_end_time: false
unvetoed: false
within_wide_win: true # no real cosmics below 85 MeV/c
within_ext_win: false
within_sig_win: false
Loading