Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
168 commits
Select commit Hold shift + click to select a range
191ba71
fix: enable test suite, fix missing dependencies, update compiler to …
ypriverol Apr 13, 2026
d7a8b61
fix(#157): preserve PSM scores when DeNovoScore is below threshold
ypriverol Apr 13, 2026
4ba9e0e
feat(#159): add -msLevel parameter for MS level filtering
ypriverol Apr 13, 2026
65c2592
Merge pull request #5 from bigbio/feature/159-ms-level-filtering
ypriverol Apr 13, 2026
5f0ced1
Merge pull request #4 from bigbio/fix/157-mzid-missing-psm-scores
ypriverol Apr 13, 2026
4f74816
Merge pull request #3 from bigbio/fix/test-infrastructure
ypriverol Apr 13, 2026
c400668
refactor: remove dead code — 150 unused classes across 10 packages
ypriverol Apr 13, 2026
79e1779
docs: modernize README with full parameter reference
ypriverol Apr 13, 2026
3909fe8
Merge pull request #8 from bigbio/feature/ci-readme-cleanup
ypriverol Apr 14, 2026
e955869
Merge pull request #7 from bigbio/refactor/dead-code-removal
ypriverol Apr 14, 2026
ae86a1a
feat: add direct TSV output (-outputFormat tsv|mzid|both)
ypriverol Apr 14, 2026
a0a36bb
feat: include Percolator features in direct TSV output
ypriverol Apr 14, 2026
09d62e2
perf: replace jmzml JAXB parser with StAX-based mzML reader
ypriverol Apr 13, 2026
31cb92c
refactor: remove jmzml dependency, add referenceableParamGroupRef sup…
ypriverol Apr 13, 2026
dbe981d
chore: remove unused jmzReader dependency
ypriverol Apr 13, 2026
17b202d
refactor: remove mzXML support and jrap/stax library
ypriverol Apr 13, 2026
55daec4
Update src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java
ypriverol Apr 14, 2026
f2a9075
fix: address Copilot review, clean up pom.xml
ypriverol Apr 14, 2026
5332df4
feat: add direct TSV output (-outputFormat tsv|mzid|both)
ypriverol Apr 14, 2026
65088eb
feat: include Percolator features in direct TSV output
ypriverol Apr 14, 2026
3eb2966
fix: update TestParsers after dead-code removal rebase
ypriverol Apr 14, 2026
1f00a2b
Merge pull request #9 from bigbio/feature/native-tsv-output
ypriverol Apr 15, 2026
d4c1f9c
Merge pull request #6 from bigbio/feature/stax-mzml-reader
ypriverol Apr 15, 2026
87e7e75
chore: split infra/packaging updates into reviewable PR (#11)
ypriverol Apr 16, 2026
d7ebe6a
feat: add dataset-scoped PXD001819 benchmark CI scaffold
ypriverol Apr 16, 2026
b3f2e98
chore: align benchmark naming and mzXML messaging
ypriverol Apr 16, 2026
ea0de94
docs: drop PXD001819 plan file; point READMEs at CI docs
ypriverol Apr 16, 2026
cf6275d
fix(benchmark): address Copilot review on PXD001819 CI scaffold
timosachsenberg Apr 16, 2026
ed14765
fix(benchmark): harden PXD001819 scaffold per review feedback
claude Apr 16, 2026
3c47109
Merge pull request #13 from bigbio/claude/review-msgfplus-pr-12-YfoTI
ypriverol Apr 16, 2026
032d088
Merge pull request #12 from bigbio/benchmark
ypriverol Apr 16, 2026
b1a1498
feat(msgf): port primitive CSR graph + flat-array GF from feat/primit…
ypriverol Apr 16, 2026
9d07047
perf(msgf): stream mass-index GF merging to drop peak memory
ypriverol Apr 17, 2026
757506c
docs: add troubleshooting guide and isobaric-labeling recipes
ypriverol Apr 17, 2026
8442f2c
perf(scorer): drop java.util.Hashtable for HashMap/ConcurrentHashMap
ypriverol Apr 17, 2026
e597230
Merge pull request #15 from bigbio/feat/primitives-optimization
ypriverol Apr 17, 2026
cafbf73
perf(scorer): cache per-scorer log tables to avoid runtime Math.log
ypriverol Apr 17, 2026
9cdae16
Merge pull request #16 from bigbio/perf/precompute-log-scores
ypriverol Apr 17, 2026
b433305
chore(reliability): actionable centroiding error + missedCleavages test
ypriverol Apr 17, 2026
7e3a69d
chore(reliability): broaden centroiding hint to cover ThermoRawFilePa…
ypriverol Apr 17, 2026
928a4f4
feat(msgf): add -minSpectraPerThread flag to override thread-cap divisor
ypriverol Apr 17, 2026
97431ce
feat(misc): add MSGFLogger, wire verbose flag into MSGFPlus entry point
ypriverol Apr 17, 2026
095100e
feat(misc): add RunManifestWriter sidecar for run reproducibility
ypriverol Apr 17, 2026
db73197
chore(mzml): annotate StaxMzMLParser BOM/prolog errors with actionabl…
ypriverol Apr 17, 2026
779eec2
feat(mzid): DirectPinWriter + -outputFormat 3 (pin) (Q7)
ypriverol Apr 17, 2026
fb76029
Merge pull request #17 from bigbio/chore/reliability-quick-wins
ypriverol Apr 17, 2026
1bd9ff2
Merge pull request #18 from bigbio/feat/logger-and-run-manifest
ypriverol Apr 17, 2026
03e6e5a
fix(mzml): explicit initCause on annotated XMLStreamException
ypriverol Apr 18, 2026
911b070
Merge pull request #19 from bigbio/feat/stax-error-context
ypriverol Apr 18, 2026
1d481aa
Merge pull request #20 from bigbio/feat/direct-pin-writer
ypriverol Apr 18, 2026
a098715
feat(pin): add lnDeltaSpecEValue and matchedIonRatio Percolator features
ypriverol Apr 18, 2026
b86d65d
feat(pin): OpenMS PercolatorAdapter parity — enzN/enzC/enzInt/mass + …
ypriverol Apr 18, 2026
b020f0e
fix(pin): sanitize NaN/Infinity feature values before emitting to Per…
ypriverol Apr 19, 2026
f3cb45e
feat(mass-cal): SearchParams + DBSearchIOFiles scaffolding for precur…
ypriverol Apr 18, 2026
0bc40cc
feat(mass-cal): MassCalibrator class with DBScanner-based residual co…
ypriverol Apr 18, 2026
98d0e8f
feat(mass-cal): wire MassCalibrator into MSGFPlus.runMSGFPlus + Score…
ypriverol Apr 18, 2026
6279905
fix(mass-cal): size-guard in learnPrecursorShiftPpm to preserve off-m…
ypriverol Apr 18, 2026
e7f7f1c
test(mass-cal): integration test for -precursorCal off bit-identity gate
ypriverol Apr 18, 2026
0a29486
fix(mass-cal): raise size-guard threshold so test.mgf doesn't trip th…
ypriverol Apr 19, 2026
78af285
Merge pull request #22 from bigbio/feat/msgfplus-perf-ab
ypriverol Apr 19, 2026
4f9e8a7
feat(fragindex): EliasFano stub with empty-list codec
ypriverol Apr 18, 2026
658b102
feat(fragindex): naive int[] encoding behind EliasFano API
ypriverol Apr 18, 2026
d046ff8
feat(fragindex): EliasFano.open() returns a Cursor for zero-copy decode
ypriverol Apr 18, 2026
3cd5f50
feat(fragindex): Fingerprint128 bit-set + popcount primitive
ypriverol Apr 18, 2026
878bd00
feat(fragindex): SlabBuilder + immutable Slab view, bucket round-trip
ypriverol Apr 18, 2026
1f4cf21
feat(fragindex): FragmentIndexStore interface + in-memory DirectStore
ypriverol Apr 18, 2026
ef6ddbf
feat(fragindex): TheoreticalFragmentGenerator for b/y singly-charged …
ypriverol Apr 18, 2026
5e66119
feat(fragindex): SlabAssigner with boundary-overlap replication
ypriverol Apr 18, 2026
6eaaf9a
feat(fragindex): PeptideTable for per-slab peptide metadata
ypriverol Apr 18, 2026
6d9ca89
feat(fragindex): FragmentIndexBuilder + FragmentIndex holder
ypriverol Apr 18, 2026
f4d01fc
feat(fragindex): SuffixArrayPeptideWalker for tryptic enumeration ove…
ypriverol Apr 18, 2026
7a42a5f
feat(fragindex): FragmentIndexBuilder.buildFromSuffixArray overload
ypriverol Apr 18, 2026
8c1fa9e
feat(fragindex): BuildSA -buildFragIndex flag + tiny-fasta integratio…
ypriverol Apr 19, 2026
b101499
feat(fragindex): -useFragmentIndex off|on|compare CLI scaffold
ypriverol Apr 19, 2026
7120091
feat(fragindex): load FragmentIndex per file in MSGFPlus.runMSGFPlus
ypriverol Apr 19, 2026
7923576
feat(fragindex): FragmentIndexCandidateGenerator — NewRankSum Tier-1 …
ypriverol Apr 19, 2026
f408a49
feat(fragindex): DBScanner dispatches to FragmentIndexCandidateGenera…
ypriverol Apr 19, 2026
a88eb3e
fix(fragindex): drop NewRankScorer partition lookup — use peak-rank-w…
ypriverol Apr 19, 2026
0f6b26e
fix(fragindex): mass-tolerance filter in generator + pin-writer toler…
ypriverol Apr 19, 2026
649a026
feat(output)!: remove mzIdentML export; pin becomes default output fo…
ypriverol Apr 19, 2026
9bf01c8
feat(cleanup)!: delete mzid reading/writing entirely — no backward-co…
ypriverol Apr 19, 2026
4c838b4
feat(pin): restore Unimod/UnimodComposition for future PTM-aware pin …
ypriverol Apr 19, 2026
185b45b
feat(pin): add longest_b/longest_y/longest_y_pct ion-series run-lengt…
ypriverol Apr 20, 2026
e228314
revert: remove abandoned fragment-index experimental code
ypriverol Apr 20, 2026
880a9d5
perf: cache Partition.hashCode + tighter CandidatePeptideGrid allocation
ypriverol Apr 20, 2026
2851cbb
docs: changelog entry for fragment-index removal + gitignore cleanups
ypriverol Apr 20, 2026
862b3a8
docs(benchmarks): add 3-engine comparison figures
ypriverol Apr 20, 2026
f2a5774
chore: cleanup — add TestPartition, archive plans, gitignore session …
ypriverol Apr 20, 2026
2402802
fix(parser): support PRIDE-style scan extraction in MGF titles
ypriverol Apr 20, 2026
9b35daa
fix(test): TestPrecursorCalIntegration reads .pin TSV instead of mzid…
ypriverol Apr 21, 2026
29a8b12
fix: address Copilot review comments on PR #23
ypriverol Apr 21, 2026
1eb5340
docs: lowercase all filenames + add output.md describing the .pin/.ts…
ypriverol Apr 21, 2026
708c8b3
docs(examples): add PXD001819 sample .pin file + link from output.md
ypriverol Apr 21, 2026
d5aea81
docs(examples): describe pxd001819_example.pin in the inventory
ypriverol Apr 21, 2026
52b99d8
Merge pull request #23 from bigbio/feat/msgfplus-speed-v2
ypriverol Apr 21, 2026
bc47193
refactor(mzid): finish removing jmzidml library dependency
ypriverol Apr 24, 2026
37d8ad1
perf(buildsa): drop Suffix[] boxing from bucket sort
ypriverol Apr 24, 2026
a778f49
perf(buildsa): parallel per-thread bucket sort + merge
ypriverol Apr 24, 2026
5d88ea5
fix(buildsa): use readFully when loading .cseq sequence bytes
ypriverol Apr 25, 2026
0d29a37
perf(buildsa): stream parallel sort output to per-worker temp files
ypriverol Apr 25, 2026
a969bef
fix(mzml): bound parser cache + MS-level preload filter + defensive c…
ypriverol Apr 25, 2026
cee6589
refactor(msgfplus): defer per-task ScoredSpectraMap construction to w…
ypriverol Apr 25, 2026
0e6539a
refactor(mzml): drop misleading bounded-cache cap; keep MS-filter + d…
ypriverol Apr 25, 2026
237baf5
docs(mzml): remove stale LRU/maxCacheSize references in StaxMzMLParse…
ypriverol Apr 25, 2026
44e2681
refactor: trim verbose javadocs and dead code in PR scope (-176 LOC)
ypriverol Apr 25, 2026
aa26013
Merge pull request #24 from bigbio/feature/improve-mzid-suffix-big-fasta
ypriverol Apr 25, 2026
a0fd630
perf(search): T1 — per-task wall stats + tail-imbalance summary
ypriverol Apr 25, 2026
957f6c2
perf(search): drop dead synchronized wrappers in DBScanner + ScoredSp…
ypriverol Apr 25, 2026
bfea7be
perf(search): per-task result buffers; drop shared synchronizedList
ypriverol Apr 25, 2026
c3afe11
perf(search): T2 — make numTasks-per-thread multiplier configurable
ypriverol Apr 25, 2026
47cf7cf
perf(search): T3 — opt-in ForkJoinPool path via -Dmsgfplus.useForkJoi…
ypriverol Apr 25, 2026
a3f48fc
perf(search): tighter result-buffer merge + drainResultsTo + reused n…
ypriverol Apr 26, 2026
1b7b5dd
perf(msgfdb): drop redundant synchronizedList on per-task SpecKey sub…
ypriverol Apr 26, 2026
2673d08
refactor(search): simplify per /simplify review (-43 LOC, no behavior…
ypriverol Apr 26, 2026
7f4b099
refactor(search): drop redundant -Dmsgfplus.numTasksPerThread sysprop
ypriverol Apr 26, 2026
9b742d7
refactor: regroup CLI/output/parser packages
ypriverol Apr 26, 2026
6d998f3
docs(plans): add search-sync-cleanup + parameter-modernization plans
ypriverol Apr 26, 2026
4bb388c
build(deps): add picocli 4.7.6 + flag inventory for params modernization
ypriverol Apr 26, 2026
1581602
refactor(cli): declare typed MSGFPlusOptions (picocli @Command)
ypriverol Apr 26, 2026
310cb33
refactor(cli): route MSGFPlus argv through picocli + adapter
ypriverol Apr 26, 2026
1fe3709
refactor(cli): unify -conf path through picocli (Phase 2)
ypriverol Apr 26, 2026
5a2ec4e
refactor: drop deprecated MSGFDB entry point + dead MSGF/MSGFLib params
ypriverol Apr 26, 2026
de71b58
refactor(cli): typed converters for tolerance + int-range CLI flags
ypriverol Apr 26, 2026
03f32c1
refactor(cli): retire ParamManager from the hot path (Phase 4c)
ypriverol Apr 26, 2026
f5f3c47
refactor: delete edu.ucsd.msjava.params hierarchy (Phase 3)
ypriverol Apr 26, 2026
1c68fb2
refactor: drop MS2/PKL/DTA_TXT spectrum format support
ypriverol Apr 27, 2026
dfa5dd9
refactor: rename parser/ package to mgf/
ypriverol Apr 27, 2026
85d0afe
fix(cli): CustomAA= config-file crash + 3 picocli polish issues
ypriverol Apr 27, 2026
8fc6e2b
fix(cli): restore -m 4 = UVPD activation method
ypriverol Apr 27, 2026
05e664a
docs: refresh README + module docs after PR #25 cleanup
ypriverol Apr 27, 2026
7a19f83
fix(cli): three Phase 4c regressions + polish on MSGFPlusOptions
ypriverol Apr 27, 2026
8330bc3
refactor(cli): typed enums for -outputFormat and -precursorCal
ypriverol Apr 27, 2026
b7dce4c
docs(changelog): document parameter-modernization sweep in vNEXT
ypriverol Apr 27, 2026
657cc5e
refactor: drop ~2,074 LOC of dead/redundant code (audit pass)
ypriverol Apr 27, 2026
4e2ad50
refactor: trim deps + dead methods across fdr/msgf/msscorer/msutil/se…
ypriverol Apr 27, 2026
f89d6ed
test: consolidate fixture builders into SearchTestFixtures
ypriverol Apr 27, 2026
6d7f8b7
chore: drop trivial comments that restate signatures
ypriverol Apr 27, 2026
fff7b82
chore: remove commented-out code blocks repo-wide
ypriverol Apr 27, 2026
a3994de
fix(mgf): strip UTF-8 BOM in BufferedLineReader + drop dead MSGFResult
ypriverol Apr 27, 2026
38b02ed
refactor: first-wave record migration (8 types)
ypriverol Apr 27, 2026
2216bbb
Merge pull request #25 from bigbio/perf/search-sync-cleanup
ypriverol Apr 27, 2026
878b0cb
docs(plans): start astral-speed-improvements; fold shipped plans into…
ypriverol Apr 28, 2026
eee9fa6
docs(plans): consolidate to 5x roadmap; adopt milestone-commit shippi…
ypriverol Apr 28, 2026
960c664
docs(plans): Phase A retrospective; revert in-tree code via separate …
ypriverol Apr 28, 2026
684abef
docs(plans): Phase E retrospective — parallelism win not replicable u…
ypriverol Apr 28, 2026
86ff529
docs(plans): Phase E GC-pressure follow-up — bigger heap helps 8t but…
ypriverol Apr 29, 2026
019facd
docs(plans): Phase E final disproof — anti-scaling and ForkJoin win w…
ypriverol Apr 29, 2026
7a684f2
fix(bench-ci): unbreak PXD001819 CI after PR #23 mzIdentML removal
ypriverol Apr 29, 2026
781738e
feat(phase-b-telemetry): add opt-in counter for pairing fan-out verif…
ypriverol Apr 29, 2026
05ec066
fix(calibrator): isolate pre-pass at iso=[0,0] + outlier-filter resid…
ypriverol Apr 29, 2026
7c027f8
feat(phase-b): expose tightening formula constants as system properties
ypriverol Apr 29, 2026
aac389c
feat(calibrator): stratify residuals by spec_eValue, keep top MIN_CON…
ypriverol Apr 29, 2026
f1a6e62
docs(plans): record Phase B Astral win after stratification fix
ypriverol Apr 29, 2026
8070e79
docs(plans): SHIPPED.md Active section reflects Phase B win
ypriverol Apr 29, 2026
d85399b
docs(plans): three-dataset Phase B validation table in SHIPPED.md
ypriverol Apr 29, 2026
957a6e9
docs(plans): Experiment 2 design — exact prefix mass-interval pruning
ypriverol Apr 29, 2026
4241fbb
feat(experiment-2): mass-interval pruning scaffold (off by default; C…
ypriverol Apr 29, 2026
f7310e9
docs(plans): Experiment 2 status header — kill gate hit on wall
ypriverol Apr 29, 2026
0c697dd
perf(experiment-2): replace TreeMap.subMap with binary-search on sort…
ypriverol Apr 29, 2026
a19b17f
docs(plans): Experiment 2 status header reflects Checkpoint 3 result
ypriverol Apr 29, 2026
8478651
perf(experiment-2): gate bound test on peptideLengthIndex >= minPepti…
ypriverol Apr 29, 2026
af65dd2
docs(plans): Experiment 2 Checkpoint 4 — gate-on-minPeptideLength shi…
ypriverol Apr 29, 2026
7a4a512
docs(plans): Experiment 2 Checkpoint 4 confirmation — 5-trial bench, …
ypriverol Apr 30, 2026
aa4aaae
chore: remove non-shippable runtime scaffolding; keep Phase B as the …
ypriverol Apr 30, 2026
5d9482d
fix(phase-b): isolate Spectrum state during calibration pre-pass
ypriverol May 1, 2026
6b8a177
chore: align .claude/plans/ and benchmark/ci/ with dev (drop from PR …
ypriverol May 1, 2026
0434bd1
feat(calibrator): expose maxSampled and minConfidentPsms as system pr…
ypriverol May 1, 2026
6512011
Merge pull request #26 from bigbio/feat/precursor-window-tightening
ypriverol May 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
60 changes: 60 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# MS-GF+ Project — Claude Context

## Overview

MS-GF+ is a mass spectrometry database search tool for peptide identification.
The codebase is Java (Maven build). Benchmark harness scripts are local-only (not committed).

## Branch

Primary integration branch: `dev`

## Key Directories

- `src/main/java/edu/ucsd/msjava/` — core Java source
- `msdbsearch/` — database search engine (DBScanner, ScoredSpectraMap)
- `msutil/` — spectrum utilities (SpecKey, SpecKeyResult, SpectrumMetadata)
- `mzid/` — `DirectPinWriter` + `DirectTSVWriter` (only writers retained; all mzIdentML classes + consumers deleted)
- `mzml/` — mzML parser (StaxMzMLParser — streaming rewrite)
- `parser/` — input file parsers (MgfSpectrumParser, etc.)
- `ui/` — CLI entry points (MSGFPlus, MSGFDB)
- Local benchmark harness/scripts are intentionally out-of-tree and not committed as `benchmark/`
- `src/test/` — unit tests

## Build

```bash
mvn -B verify
```

**Do NOT run full `mvn test` without scoping.** The suite includes `TestPrecursorCalIntegration` which runs 4 full MS-GF+ searches on the 82 MB `human-uniprot-contaminants.fasta` fixture and takes ≥ 90 min on an idle machine. For iteration, scope to relevant classes:

```bash
mvn -B -o test -Dtest='TestDirectPinWriter,TestMassCalibrator,TestPrecursorCalScaffolding'
```

## Conventions

- Java 17+
- Maven for dependency management
- Percolator `.pin` as the default output format (mzIdentML output removed; feed downstream via Percolator)
- TSV export via DirectTSVWriter
- Percolator `.pin` export via DirectPinWriter (PR #20 + PR #22)

## Performance-sensitive invariants (learned empirically)

- **Never wrap hot-path collections in `Map.copyOf` / `ImmutableCollections`.** Observed 2.2× Astral regression — likely a bad interaction between `Partition.hashCode` clustering and ImmutableCollections' open-addressing.
- **Any optional scoring-path feature behind a flag must be bit-identical to baseline when disabled.** Implement via `if (mode == OFF) return input_unchanged;` at the top of the entry point — do NOT rely on "multiply by zero" or "flag-dependent branch deep in the loop"; both reorder float ops.
- **Pre-passes (calibrators, samplers) must not mutate shared state.** MS-GF+'s `Spectrum` objects are shared across the pre-pass and main pass; mutating them in the pre-pass (e.g. via `scorer.getScoredSpectrum(spec)`) causes silent PSM-count drift when the main pass re-reads the mutated state.

## Benchmark harness

Local-only, gitignored (`benchmark/*` with `!benchmark/README.md` / `!benchmark/ci/` carve-outs). Three 3-arm scripts per dataset:

- `benchmark/run_pxd001819_3arm.sh` / `run_astral_3arm.sh` / `run_tmt_3arm.sh` — each runs baseline JAR / branch off / branch auto and produces `.pin` files
- `benchmark/compare_*_3arm_percolator.sh` — runs Percolator via Docker (biocontainers 3.7.1) on each pin; prints 1% / 5% FDR target counts
- See `~/.claude/projects/-Users-yperez-work-msgfplus/memory/reference_benchmark_infra.md` for full details (conda env, Docker image, dataset locations)

## Next planned work

**Speed v2: fragment-index as candidate generator.** The current `feat/frag-index-phase1` branch (local, not pushed) has a working fragment-index OFF-path and a broken ON-path. The next session's mission is a clean rewrite per `~/.claude/plans/msgfplus-fragment-index/speed-rewrite-v2.md`. Target: ≥10× Astral speedup while preserving recall and reducing memory.
91 changes: 91 additions & 0 deletions .claude/investigations/001-mgf-scan-number-extraction-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Investigation 001: MGF Scan Number Extraction Failure

**Status:** OPEN
**Date observed:** 2026-04-15
**Severity:** Medium — functional (spectra still searched, but scan numbers missing in output)
**Branch:** `feature/streaming-mzml-parser`

## What Was Observed

When running the baseline benchmark against MGF files, MS-GF+ emits repeated warnings:

```
Unable to extract the scan number from the title: id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05.mzML;controllerType=0
Expected format is DatasetName.ScanStart.ScanEnd.Charge
```

The warning appeared for every spectrum in the MGF file (`test.mgf`), suggesting
the entire file uses a TITLE format that the parser cannot handle.

## Where It Was Observed

- **Run:** Baseline benchmark (`baseline/MSGFPlus.jar`, v2026.03.25)
- **Input:** `test.mgf` — MGF file with TITLE lines in PRIDE/ProteomeXchange format
- **Database:** `human-uniprot-contaminants.revCat.fasta`

## Relevant Code

### `MgfSpectrumParser.extractScanRangeFromTitle()` — the parser

```
src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java:278-316
```

The method splits the title on `.` and expects:
- `token.length > 3` → `DatasetName.ScanStart.ScanEnd.Charge`
- `token.length == 3 && title.endsWith(".")` → `DatasetName.ScanStart.ScanEnd.`

The PRIDE-format title `id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05.mzML;controllerType=0`
splits to `["id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05", "mzML;controllerType=0"]`
(only 2 tokens), so it falls through to the `else` branch and emits the warning.

### `MgfSpectrumParser.warnScanNotFoundInTitle()` — the warning

```
src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java:384-392
```

Capped at `MAX_SCAN_MISSING_WARNINGS` prints, then silently counts the rest.
Final total printed by `SpecKey.java:139`.

## Hypotheses

1. **Title format mismatch (most likely):** The MGF file uses a PRIDE/ProteomeXchange
`TITLE` format that encodes the source file reference and controller info with
semicolons, not the `Dataset.Start.End.Charge` convention. The parser has no
fallback for alternative formats.

2. **Possible alternative scan encodings in TITLE:** Some MGF generators embed scan
numbers as `scan=NNNN` or `scans=NNNN` within the TITLE string. The parser
doesn't attempt to extract these.

3. **`index=` fallback:** When scan extraction fails, the spectrum gets assigned
`index=N` as its ID (from `specIndexMap`). This means the mzIdentML output
will reference spectra by index rather than native scan number, which may
affect downstream tools that expect scan-based references.

## Impact

- **Search results:** Not affected — MS-GF+ still searches the spectra correctly.
- **Output traceability:** Degraded — mzIdentML references use index instead of
native scan IDs, making it harder to trace PSMs back to the raw data.
- **Benchmark:** May cause metric discrepancies if downstream scripts parse scan
numbers from the mzIdentML output.

## Potential Fixes

1. Add regex-based fallback in `extractScanRangeFromTitle()` to detect patterns like:
- `scan=(\d+)` or `scans=(\d+)`
- `spectrum=(\d+)`
- `index=(\d+)`
2. Support PRIDE USI-style TITLE parsing: extract scan from
`controllerType=0 controllerNumber=1 scan=NNNN` if present.
3. Allow users to specify a scan number extraction regex via CLI parameter.

## Next Steps

- [ ] Examine the actual MGF file to see the full TITLE line format
- [ ] Check if `scan=` or similar key-value pairs are embedded in the TITLE
- [ ] Review how other tools (MaxQuant, Comet, X!Tandem) handle non-standard TITLE formats
- [ ] Decide on backward-compatible fix approach
- [ ] Add unit test covering PRIDE-format TITLE strings
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Investigation 002: E-value Leaks Target/Decoy Information to Percolator

**Status:** OPEN
**Date reported:** 2026-04-15
**Severity:** HIGH — affects FDR estimation for all downstream rescoring tools
**Source:** EuBIC-MS Symposium 04/2026, Copenhagen — Henry Emanuel Weber, Ruhr-Universität Bochum (Jun.-Prof. Julien Urchueguía group)
**Slide screenshot:** `assets/Screenshot_2026-04-15_at_13.23.09-*.png`

## What Was Observed

When MS-GF+ results are passed to rescoring tools (Percolator, MS2Rescore, Oktoberfest),
the target and decoy score distributions become **completely separated** — 100% separation.
This does NOT happen with Comet results on the same data.

The presenter found that **removing the E-value (MS:1002053) from the MS-GF+ features
fixed the problem**, confirming that the E-value is the source of information leakage.

Key observations from the slide:
- **Comet + TDA/Percolator/MS2Rescore/Oktoberfest:** Normal overlapping distributions
- **MS-GF+ + TDA:** Normal overlapping distributions (E-value not used as feature)
- **MS-GF+ + Percolator/MS2Rescore/Oktoberfest:** Perfect separation (E-value used as feature)

## The Mechanism

### How MS-GF+ computes the E-value

The E-value is computed as:

```
E-value = SpecEValue × numDistinctPeptides
```

See `MZIdentMLGen.java:347`:
```java
double eValue = specEValue * numPeptides;
```

Where:
- **SpecEValue** (`MS:1002052`) = spectral-level E-value from the generating function
(computed per spectrum, independent of target/decoy status)
- **numDistinctPeptides** = count of distinct peptide sequences of the matched length
in the **entire** concatenated target-decoy database
(from `CompactSuffixArray.getNumDistinctPeptides()`)

### Why it leaks

The `numDistinctPeptides` multiplier is derived from the suffix array built over the
**concatenated target+decoy database** (`-tda 1` mode). The count includes both target
and decoy peptides.

However, the critical issue is that `numDistinctPeptides` is looked up by **peptide
length** (see `CompactSuffixArray.java:138-140`):

```java
public int getNumDistinctPeptides(int length) {
return numDistinctPeptides[length];
}
```

This is the same multiplier for targets and decoys of the same length, so the
E-value itself doesn't directly encode target/decoy status. The leakage likely
comes from a subtler mechanism:

**Hypothesis 1: Database-size asymmetry**
When `-tda 1` is used, MS-GF+ generates reversed decoys internally. The number
of distinct peptides at each length may differ slightly between the target and
decoy halves. Since the E-value uses the combined count, it implicitly encodes
information about the database composition. Percolator, being a machine learning
model, can learn to exploit even tiny systematic differences.

**Hypothesis 2: Score distribution coupling**
The generating function that produces SpecEValue is computed using score
distributions that are calibrated on the full database. If the score distribution
shape differs systematically between target and decoy hits (which it does — true
matches exist only for targets), the SpecEValue already carries some target/decoy
signal that gets amplified by the numPeptides multiplier.

**Hypothesis 3: Q-value propagation**
The Q-value (`MS:1002054`) is explicitly computed from TDA and directly encodes
target/decoy ranking. If Q-value is also passed to Percolator alongside E-value,
the combined features create a perfect classifier. However, the presenter
specifically identified E-value (not Q-value) as the problematic score.

**Hypothesis 4: E-value scale differences**
SpecEValue is a per-spectrum probability; E-value is SpecEValue × database_size.
Since all peptides (target and decoy) use the same `numDistinctPeptides[length]`,
the E-value is a monotonic transform of SpecEValue for peptides of the same
length. But across different lengths, the scaling differs, and Percolator could
learn length-dependent patterns that correlate with target/decoy status.

## Relevant Code

### E-value computation

- `MZIdentMLGen.java:345-347` — `eValue = specEValue * numPeptides`
- `DirectTSVWriter.java:138-141` — same computation for TSV output
- `DBScanner.java:853-854` — same computation for MSGFDB output
- `MSGFDBResultGenerator.java:92-104` — `getPValue()` and `getEValue()` static methods

### numDistinctPeptides lookup

- `CompactSuffixArray.java:138-140` — `getNumDistinctPeptides(length)`
- `CompactSuffixArray.java:196-228` — counting logic over suffix array
- `SuffixArrayForMSGFDB.java:43-46` — wrapper

### Scores written to mzIdentML

- `MS:1002049` — RawScore (integer, safe)
- `MS:1002050` — DeNovoScore (integer, safe)
- `MS:1002052` — SpecEValue (spectral E-value, probably safe)
- `MS:1002053` — EValue (database E-value, **LEAKS**)
- `MS:1002054` — QValue (from TDA, **inherently encodes T/D**)

## Impact

- **All rescoring workflows are affected:** Any tool that uses MS-GF+ E-value as a
feature (Percolator, MS2Rescore, Oktoberfest) will produce artificially inflated
identification rates
- **Published results may be affected:** Studies using MS-GF+ → Percolator pipelines
may report overly optimistic PSM counts
- **FDR estimates are unreliable:** The 100% target/decoy separation means FDR
cannot be meaningfully estimated

## Which Scores Leak?

### Safe scores (no target/decoy information)
| CV Accession | Name | Why safe |
|-------------|-------------|----------|
| MS:1002049 | RawScore | Integer score from generating function, per-spectrum |
| MS:1002050 | DeNovoScore | Integer de novo score, per-spectrum |
| MS:1002052 | SpecEValue | Spectral E-value from generating function, per-spectrum. No TDA dependency. |

### Unsafe scores (leak target/decoy information)
| CV Accession | Name | Why it leaks |
|-------------|------------|--------------|
| MS:1002053 | EValue | `SpecEValue × numDistinctPeptides` — database-size multiplier may introduce asymmetry. Confirmed as the leak source by the presenter. |
| MS:1002054 | QValue | **Directly computed from TDA** via `TargetDecoyAnalysis.getPSMQValue()` — it IS the target/decoy separation. Passing this to Percolator is giving it the answer key. |
| MS:1002055 | PepQValue | Same as QValue but at peptide level. Also directly from TDA. |

### Q-value is categorically worse than E-value

The Q-value (`MS:1002054`) is computed by `TargetDecoyAnalysis.getFDRMap()` which:
1. Separates PSMs into target and decoy lists (by protein prefix, e.g. `XXX_`)
2. Sorts both by score
3. Walks down the ranked list computing `FDR = decoyCount / targetCount`
4. Converts FDRs to Q-values (monotonic minimum)

This is a **direct encoding** of target vs decoy status. If Percolator receives
QValue as a feature, it can trivially reconstruct whether a PSM is target or
decoy — far more directly than the E-value leakage. The EValue leakage is subtle
(the presenter had to investigate to find it); QValue leakage is by definition.

In practice, most rescoring tools (Percolator, MS2Rescore) likely skip QValue
because it's already an FDR estimate. But EValue looks like a "normal" search
engine score and gets picked up as a feature — which is why the EValue leak
is the one that actually manifests.

## Proposed Fix: Only Output SpecEValue (Omit EValue and QValue)

Since the downstream workflow is always `MS-GF+ → Percolator/rescoring tool → FDR`,
MS-GF+ does not need to output its own EValue or QValue. The rescoring tool will
compute its own FDR.

### What to change
1. **Stop writing EValue (MS:1002053) to mzIdentML** — or make it optional via CLI flag
2. **Stop writing QValue (MS:1002054) and PepQValue (MS:1002055)** — same treatment
3. **Keep SpecEValue (MS:1002052)** — this is the per-spectrum score, safe for rescoring
4. **Keep RawScore (MS:1002049) and DeNovoScore (MS:1002050)** — integer scores, safe

### Where to change
- `MZIdentMLGen.java:346-421` — mzIdentML output (remove/gate EValue, QValue, PepQValue CV params)
- `DirectTSVWriter.java:140-208` — TSV output (same)
- `DBScanner.java:853` — MSGFDB TSV output (same)
- `MSGFPlus.java` / `MSGFDB.java` — add CLI flag (e.g. `--no-evalue` or `--percolator-safe`)

### Impact on MSGFPlusAdapter (OpenMS)
The OpenMS `MSGFPlusAdapter` extracts scores from MS-GF+ mzIdentML output. If we
stop outputting EValue by default, the adapter needs to be updated to use SpecEValue
instead. This should be coordinated with the OpenMS team, or we add a CLI flag
so existing workflows keep working.

### Backward compatibility
- Add a flag like `-rescoring 1` that omits EValue/QValue from output
- Default behavior unchanged (EValue/QValue still written) for backward compat
- Document clearly that `-rescoring 1` should be used when piping to Percolator

## Next Steps

- [ ] Reproduce the issue: run MS-GF+ on a benchmark dataset, feed to Percolator,
plot target/decoy distributions with and without E-value
- [ ] Contact Henry Emanuel Weber / Julien Urchueguía group for their test dataset
and exact Percolator configuration
- [ ] Analyze whether SpecEValue alone also leaks (likely not, but should verify)
- [ ] Check if the leakage magnitude depends on database size (small DB = more leakage?)
- [ ] Review what scores MS2Rescore/Percolator extract from MS-GF+ mzIdentML by default
- [ ] Implement `-rescoring 1` CLI flag to omit EValue/QValue/PepQValue from output
- [ ] Coordinate with OpenMS team on MSGFPlusAdapter changes (use SpecEValue instead of EValue)
- [ ] Add skill documentation (DONE — see `.claude/skills/score-output-safety.md`)

## References

- Slide: "Target and decoy distributions" — EuBIC-MS Symposium 04/2026, Copenhagen
- Presenter: Henry Emanuel Weber, Medical Bioinformatics, Ruhr-Universität Bochum
- Group: Jun.-Prof. Julien Urchueguía
- Talk: "Leveling the playing field" (slide 9)
10 changes: 10 additions & 0 deletions .claude/investigations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Investigations

Tracked issues, bugs, and behaviors that need further analysis.

Each investigation should document:
1. **What was observed** — error messages, unexpected behavior
2. **Where it was observed** — which run, dataset, configuration
3. **Relevant code** — source files and line numbers
4. **Hypotheses** — potential root causes
5. **Status** — open / in-progress / resolved
14 changes: 14 additions & 0 deletions .claude/plans/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Plans

Implementation plans and design documents for MS-GF+ features and improvements.

Each plan is a separate markdown file named descriptively, e.g.:
- `streaming-mzml-parser.md`
- `mgf-scan-number-parsing.md`

## Archived / superseded

- `~/.claude/plans/msgfplus-primitives-optimization/plan.md` — shipped in PRs #15-#20 + PR #22 (P2-cal). Historical reference.
- `~/.claude/plans/msgfplus-fragment-index/` — **abandoned 2026-04-20** after failing speed/recall/memory gates. See `ABANDONED-2026-04-20.md` for the post-mortem. Alternative speed ideas (graph-skeleton caching, adaptive tolerance, parallelism ceiling) are documented there.

Detailed plans live under `~/.claude/plans/` (outside the repo) to avoid checking planning artifacts into git.
Loading
Loading