Skip to content

Dev#31

Merged
ypriverol merged 8 commits into
masterfrom
dev
May 23, 2026
Merged

Dev#31
ypriverol merged 8 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented May 23, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • CLI parameters now use Rust-style named values (e.g., --enzyme-specificity) with backward-compatible aliases for existing scripts using legacy numeric forms.
  • Documentation

    • Comprehensive DOCS.md covering CLI parameters, search configuration, mzML auto-detection, and mods file format.
    • CLI_MIGRATION.md providing migration guidance from Java MS-GF+ with worked examples.
    • Updated README with Rust installation and usage instructions.
    • Removed obsolete Java MS-GF+ documentation.
  • Tests

    • Added CLI regression test validating byte-identical search results across both parameter syntaxes.

Review Change Stack

ypriverol and others added 8 commits May 23, 2026 12:33
…r state

Design document for iter39:
- Rewrite README.md as a linear narrative serving both quantms operators
  and mass-spec researchers (~190 lines).
- New single-file DOCS.md reference at repo root (~505 lines).
- New CLI_MIGRATION.md with Java → Rust flag mapping + numeric-legacy
  → named-value table + worked examples (~100 lines).
- CLI rename: numeric enum IDs → named values (--fragmentation HCD vs
  --fragmentation 3); --ntt → --enzyme-specificity; --mod → --mods.
  All legacy forms still accepted silently for quantms script compat.
- Delete the entire user-facing docs/ tree.

The Rust port now beats Java MS-GF+ on all 3 benchmark datasets; this
iteration treats msgf-rust as a new app and writes its docs from scratch
to fit.

Acronym style: HCD/CID/ETD/UVPD/TMT/iTRAQ/TOF uppercase, QExactive in
brand casing, descriptive values (auto, low-res, fully, etc.) in
lowercase kebab-case. clap parses case-insensitively so quantms scripts
that lowercase values still work.

ScoringParamGen porting is acknowledged as roadmap work, not in this
iteration.
Implementation plan for iter39 — docs rewrite + CLI rename.

Plan structure: 5 sequential commits on iter39-docs-rewrite, decomposed
into 8 tasks of bite-sized TDD steps.

- Tasks 1-3 produce Commit 1: CLI rename + enums + custom parsers +
  resolver signature change + 15 updated unit tests + 1 new round-trip
  integration test.
- Task 4: rewrite README.md (full content embedded).
- Task 5: add DOCS.md (skeleton + per-section content guides; the
  prose-heavy sections defer to the spec and source code for content).
- Task 6: add CLI_MIGRATION.md (full content embedded — Table A
  Java→Rust, Table B legacy-numeric→named, three worked examples).
- Task 7: delete the legacy docs/ tree (36+ tracked files);
  engineering planning subdirectories preserved.
- Task 8: push branch + open PR.

Each step is one action (2-5 min). Commits land in dependency order.
The new round-trip test (cli_smoke.rs) guards the back-compat path
by asserting --fragmentation 3 and --fragmentation HCD produce
byte-identical PIN output.

Constraint observed: no commit message in this plan contains the word
that triggers the no-claude-attribution hook.
Replace numeric Java-historical enum flags with Rust-idiomatic named
values and rename --mod → --mods, --ntt → --enzyme-specificity. All
legacy forms still accepted silently for quantms script compat.

Canonical (shown in --help):
- --fragmentation auto|CID|ETD|HCD|UVPD     (default: auto)
- --instrument low-res|high-res|TOF|QExactive (default: low-res)
- --protocol auto|phospho|iTRAQ|iTRAQ-phospho|TMT|standard (default: auto)
- --enzyme-specificity non-specific|semi|fully (default: fully)
- --mods <FILE>   (singular --mod kept as hidden alias)

Legacy (silently accepted):
- --fragmentation 0..=4
- --instrument 0..=3
- --protocol 0..=5
- --ntt 0..=2          (--ntt is also a clap alias of --enzyme-specificity)
- --mod <FILE>

clap parses values case-insensitively, so quantms scripts that lowercase
named values (--fragmentation hcd) keep working.

Internal:
- Added four ValueEnum-derived enums: Fragmentation, Instrument,
  Protocol, EnzymeSpecificity.
- Added four custom value parsers: parse_fragmentation,
  parse_instrument, parse_protocol, parse_enzyme_specificity. Each tries
  the canonical named value first, falls back to the legacy numeric ID.
- Changed resolve_bundled_param and resolve_bundled_param_for_activation
  signatures from Option<u8> triples to strongly-typed enums. The
  "all-defaults short-circuit" (which produced HCD_QExactive_Tryp.param
  pre-iter39 when no flags were given) is preserved via the
  Fragmentation::Auto + Instrument::LowRes + Protocol::Auto check.
- Updated the 15 param_resolver_tests for the new signature; replaced
  the three "rejects out of range" resolver tests with equivalent tests
  on the parser functions (clap rejects bad values at parse time now).

Verified:
- cargo test --release -p msgf-rust → 18 passed (15 resolver tests
  + 3 new parser-out-of-range tests).
- cargo test --release -p msgf-rust --test cli_smoke → 8 passed
  (7 existing + 1 new round-trip).
- cargo test --release --workspace → no new failures vs baseline.

New regression guard: cli_accepts_both_named_and_numeric_param_values
runs a small search twice (once with --fragmentation 3 --protocol 4,
once with --fragmentation HCD --protocol TMT) and asserts PIN outputs
are byte-identical.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the legacy Java-tool README (193 lines, Java 17 + JAR + mvn) with
a linear-narrative README for the Rust port (~190 lines, dual audience).

Sections, top to bottom:
1. Title + tagline + badges (CI, release, license)
2. What is this? — one paragraph, names UCSD original
3. Why msgf-rust? — benchmark table vs Java on Astral / PXD001819 / TMT
4. Install — release archive, cargo install, build from source
5. Quick Start — minimal command, one paragraph on .pin row shape
6. Common workflows — tryptic DDA, TMT, TSV output, quantms integration
7. CLI summary — table of ~17 most-used flags
8. Auto-detection — activation/instrument detection from mzML
9. Parity vs Java MS-GF+ — short summary; pointer to DOCS.md §8d
10. Citation
11. License — UCSD-Noncommercial; pointer to java-legacy and
    java-legacy-original branches
12. Acknowledgments

quantms operators have a labeled section in #6 + the CLI summary in #7.
Researchers see the benchmark proof up front in #3.

The full CLI reference, mods.txt grammar, PIN/TSV column docs, training
notes, and Java→Rust migration table live in DOCS.md (separate commit).
The Java→Rust flag mapping table lives in CLI_MIGRATION.md (separate
commit).

Co-authored-by: Cursor <cursoragent@cursor.com>
Add DOCS.md at repo root: the full power-user reference covering all
flags, formats, build/test workflow, training notes, and Java→Rust
migration. ~505 lines, navigated via a top-of-file table of contents.

Sections:
1. CLI reference — every flag with type/default/description and
   accepted legacy form
2. Mods.txt format — grammar + 3 worked examples
3. Output formats — PIN columns, TSV columns, when to use which
4. Auto-detection — activation method detection from mzML +
   param-file resolution table
5. Building from source — Rust 1.85+, cargo build/test, the 7 CI-skipped
   tests and reasons
6. Training new .param files — current state (reuse Java's bundled
   files), roadmap (port ScoringParamGen), interim workflow
   (train on java-legacy, --param-file at the Rust binary)
7. Isobaric labeling — TMT and iTRAQ workflows, required mods entries,
   auto-selected param file
8. Java MS-GF+ → msgf-rust migration — flag rename table, behavior
   differences, known parity divergences
9. License and citation

The DOCS.md design follows the linear-narrative pattern of README.md:
no nested directories, no site generator, just one Cmd-F-friendly file.

Co-authored-by: Cursor <cursoragent@cursor.com>
One-page reference for porting Java MS-GF+ command lines or quantms
scripts to msgf-rust. Covers:

- Table A: Java flag → msgf-rust flag mapping (18 flags).
- Table B: numeric-legacy → canonical named value mapping (one row per
  legacy ID across fragmentation, instrument, protocol, enzyme-specificity).
- Three worked examples (plain tryptic DDA; TMT 10-plex; phospho STY)
  showing the Java MS-GF+ command line and the msgf-rust equivalent
  side-by-side.
- Notes on behaviors that simply don't exist on the Rust side (no
  -tda flag, no -e enzyme flag, no mzXML/PKL/MS2 input, no mzIdentML
  output).

msgf-rust silently accepts the legacy forms (--fragmentation 3,
--mod, --ntt) for backward compatibility with quantms scripts. New
canonical forms are documented for fresh users.

Co-authored-by: Cursor <cursoragent@cursor.com>
The docs/ tree predated the Rust cutover and described the Java tool
(mvn build, JAR distribution, Java CLI). Content that still applies has
been migrated to root-level README.md, DOCS.md, and CLI_MIGRATION.md.

Deleted (38 tracked files):
- docs/msgfplus.md (full Java CLI reference — superseded by DOCS.md §1)
- docs/msgfdb_modfile.md (mods.txt grammar — superseded by DOCS.md §2)
- docs/output.md (PIN/TSV columns — superseded by DOCS.md §3)
- docs/buildsa.md (Java standalone SA builder — Java-only utility)
- docs/training-scoring-models.md (Java trainer — superseded by DOCS.md §6)
- docs/isobariclabeling.md (TMT/iTRAQ — superseded by DOCS.md §7)
- docs/troubleshooting.md (Java JVM tuning — Java-only)
- docs/changelog.md (Java release notes — GitHub Releases tracks v0.1.0+)
- docs/readme.md (Java tool overview — superseded by root README.md)
- docs/benchmarks/ (3 PNG figures from Java perf comparison — stale)
- docs/examples/ (Mods.txt + activation/enzyme/protocol samples —
  inline examples in DOCS.md instead)
- docs/parameterfiles/ (15 Java -conf templates — no Rust equivalent)

Preserved:
- docs/superpowers/specs/ — design specs (engineering planning).
- docs/superpowers/plans/ — implementation plans (engineering planning).
- docs/parity-analysis/ (already gitignored since commit 5e9b63a;
  no action needed).

Co-authored-by: Cursor <cursoragent@cursor.com>
iter39: docs + CLI rename for the post-cutover state
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@ypriverol ypriverol merged commit 18360a3 into master May 23, 2026
8 of 9 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82a284c1-e1ec-43d2-8c16-fd8f86778330

📥 Commits

Reviewing files that changed from the base of the PR and between c863dae and 0b137bc.

⛔ Files ignored due to path filters (5)
  • docs/benchmarks/fig1_baseline_vs_currentdev.png is excluded by !**/*.png
  • docs/benchmarks/fig2_currentdev_vs_sage.png is excluded by !**/*.png
  • docs/benchmarks/fig3_psms_and_peptides.png is excluded by !**/*.png
  • docs/examples/test.tsv is excluded by !**/*.tsv
  • docs/examples/test_Unrolled.tsv is excluded by !**/*.tsv
📒 Files selected for processing (40)
  • CLI_MIGRATION.md
  • DOCS.md
  • README.md
  • crates/msgf-rust/src/bin/msgf-rust.rs
  • crates/msgf-rust/tests/cli_smoke.rs
  • docs/buildsa.md
  • docs/changelog.md
  • docs/examples/MSGFPlus_Params.txt
  • docs/examples/Mods.txt
  • docs/examples/activationMethods.txt
  • docs/examples/enzymes.txt
  • docs/examples/protocols.txt
  • docs/examples/pxd001819_example.pin
  • docs/examples/readme.md
  • docs/isobariclabeling.md
  • docs/msgfdb_modfile.md
  • docs/msgfplus.md
  • docs/output.md
  • docs/parameterfiles/MSGFPlus_PartTryp_DynMetOx_ProOx_Stat_CysAlk_TMT_6Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_DynMetOx_Stat_CysAlk_TMT_6Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_DynMetOx_Stat_CysAlk_iTRAQ_8Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_DynMetOx_Stat_TMT_6Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_Dyn_MetOx_CustomAA_O_Hydroxyproline_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_Dyn_MetOx_NTermAcet_NQR_Deamide_Stat_CysAlk_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_MetOx_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_MetOx_StatCysAlk_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_StatCysAlk_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_PartTryp_Stat_CysAlk_TMT_6Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_DynMetOx_Stat_CysAlk_TMT_6Plex_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_DynSTYPhos_Stat_CysAlk_TMT_6Plex_Protocol1_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_Dyn_MetOx_STYPhos_Stat_CysAlk_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_MetOx_15ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_MetOx_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_MetOx_StatCysAlk_20ppmParTol.txt
  • docs/parameterfiles/MSGFPlus_Tryp_NoMods_20ppmParTol.txt
  • docs/readme.md
  • docs/superpowers/plans/2026-05-23-iter39-docs-rewrite.md
  • docs/superpowers/specs/2026-05-23-iter39-docs-rewrite-design.md
  • docs/training-scoring-models.md
  • docs/troubleshooting.md

📝 Walkthrough

Walkthrough

This PR refactors the msgf-rust CLI from numeric parameters to strongly-typed enums while maintaining backward compatibility, validates the refactoring with a regression test, replaces root documentation, and removes legacy Java-era docs.

Changes

CLI refactor and documentation

Layer / File(s) Summary
Typed CLI parameters with dual parsing
crates/msgf-rust/src/bin/msgf-rust.rs (lines 26–68, 959–1038)
Four new clap::ValueEnum enums (Fragmentation, Instrument, Protocol, EnzymeSpecificity) with canonical Rust-style names and custom parsers accepting legacy numeric IDs, each producing range-specific error messages for out-of-range values.
CLI struct refactoring to use typed enums
crates/msgf-rust/src/bin/msgf-rust.rs (lines 120–127, 160–161, 170–188)
CLI fields replaced with typed enums; --ntt becomes --enzyme-specificity with hidden --ntt alias; --mod renamed to --mods with hidden --mod alias; parsers wired via value_parser directives.
Core resolver refactoring for typed parameters
crates/msgf-rust/src/bin/msgf-rust.rs (lines 697–740, 902–919)
resolve_bundled_param refactored to accept and normalize typed Fragmentation/Instrument/Protocol; implements HCD+LowRes→QExactive upgrade and protocol suffix mapping; resolve_bundled_param_for_activation translates ActivationMethod into typed enum routing.
Runtime integration and auto-detection updates
crates/msgf-rust/src/bin/msgf-rust.rs (lines 345–346, 399–400, 447–451)
Mods parsing switches to cli.mods; auto-route eligibility tightened to check Fragmentation::Auto and Instrument::LowRes; num_tolerable_termini derived from typed enzyme_specificity enum.
Resolver test suite updates
crates/msgf-rust/src/bin/msgf-rust.rs (lines 1049–1053, 1064–1068, 1083–1087, 1100–1104, 1117–1121, 1130–1144, 1152–1217)
All resolver test cases updated to use typed enum arguments; out-of-range error validation shifted from resolver calls to dedicated unit tests for the new parsing functions.
CLI backward compatibility regression test
crates/msgf-rust/tests/cli_smoke.rs (lines 262–332)
New integration test running msgf-rust twice using legacy numeric and canonical named CLI syntax on the same fixture; asserts identical PIN headers and sorted output rows to guarantee byte-identical search behavior.
Root documentation
README.md, DOCS.md, CLI_MIGRATION.md
README.md replaced with Rust-focused installation/usage; DOCS.md created with CLI reference (including legacy numeric aliases), mods.txt grammar, output schemas, auto-detection, build/test guidance, training notes, isobaric labeling, and migration notes; CLI_MIGRATION.md created with Java MS-GF+ flag mapping, numeric-legacy-to-named table, and three worked rewrite examples.
Legacy documentation cleanup
docs/buildsa.md, docs/examples/*, docs/parameterfiles/*, docs/*.md
Removed legacy Java MS-GF+ documentation: BuildSA reference, example configs, 15+ parameter files, modification/enzyme/protocol templates; preserved docs/superpowers/specs/ and docs/superpowers/plans/ planning artifacts.
Implementation plan and design specifications
docs/superpowers/plans/2026-05-23-iter39-docs-rewrite.md, docs/superpowers/specs/2026-05-23-iter39-docs-rewrite-design.md
Complete plan and design spec documenting CLI rename/enum implementation, documentation rewrite, resolver/test updates, new regression test, commit/PR structure, risks, and self-review checklist.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


🐰 A CLI refactor hops into town,
With types so strong and names so clear,
No more numeric down-and-down,
Backward compat keeps us here. 🦁✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant