Skip to content

bigbio/msgf-rust

Repository files navigation

msgf-rust — peptide identification from MS/MS spectra

CI Release License: UCSD-Noncommercial

A Rust port of MS-GF+ — takes mzML/MGF spectra + FASTA in, produces Percolator-ready .pin out. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster.

What is this?

msgf-rust is a from-scratch Rust reimplementation of MS-GF+ (Kim & Pevzner, 2014), the canonical generating-function peptide-identification engine. It reads MS/MS spectra (mzML or MGF), searches them against a FASTA protein database, and emits Percolator-ready PIN rows (or a TSV) with per-PSM features for rescoring. The original Java implementation is preserved on the java-legacy branch.

Why msgf-rust?

Three datasets, three results (all at 1% FDR via Percolator 3.7.1):

Dataset Java MS-GF+ PSMs msgf-rust PSMs Δ Java wall msgf-rust wall Wall Δ
Astral DDA (LFQ_Astral_DDA_15min_50ng) 35,818 36,170 +352 (+0.98%) 5:49 5:57 within 2%
PXD001819 (UPS1 yeast tryp) 14,798 14,760 -38 (-0.26%) ~150s 45.88s 3.3× faster
TMT (a05058 PXD007683) 10,166 11,108 +9.3% ~2:55 2:30 14% faster

What that means: on Astral we find more peptide hits than Java; on PXD001819 we match Java's hit count at 3.3× the speed; on TMT we find ~9% more PSMs at 14% less wall. The remaining feature-level divergences (lnEValue, MeanRelErrorTop7 normalization) are tracked in DOCS.md §8d as research follow-up — they don't gate cutover.

Install

Option 1 — download a release archive (recommended):

Grab the archive for your platform from the Releases page. Five platform builds are published per release:

msgf-rust-<version>-x86_64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-aarch64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-x86_64-apple-darwin.tar.gz
msgf-rust-<version>-aarch64-apple-darwin.tar.gz
msgf-rust-<version>-x86_64-pc-windows-msvc.zip

Each archive contains the msgf-rust binary, the resources/ tree (39 bundled .param files + unimod.obo), and LICENSE/NOTICE/README.

Option 2 — cargo install:

cargo install --git https://github.com/bigbio/msgf-rust --bin msgf-rust

Option 3 — build from source:

git clone https://github.com/bigbio/msgf-rust
cd msgf-rust
cargo build --release
# Binary: target/release/msgf-rust

Requires Rust 1.85+ (see rust-toolchain.toml).

Quick Start

msgf-rust \
  --spectrum BSA.mgf \
  --database BSA.fasta \
  --output-pin out.pin

This runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to out.pin, and prints per-phase timings to stderr. Feed out.pin directly into Percolator (Docker or native) to compute q-values.

A row in out.pin is one peptide–spectrum match with 28 columns: SpecId, Label, ScanNr, charge one-hot encoding, then features like RawScore, lnSpecEValue, DeNovoScore, ion-current ratios, peptide-length stats, etc. Full column reference: DOCS.md §3a.

Common workflows

Tryptic DDA + Percolator (default):

msgf-rust --spectrum spectra.mzML --database db.fasta --output-pin out.pin
docker run --rm -v $(pwd):/data biocontainers/percolator:v3.7.1_cv1 \
  percolator -X /data/weights.txt /data/out.pin

TMT 10-plex search with mods.txt:

msgf-rust \
  --spectrum tmt_spectra.mzML \
  --database hsapiens.fasta \
  --output-pin out.pin \
  --mods tmt_10plex_mods.txt \
  --protocol TMT \
  --fragmentation HCD \
  --instrument QExactive

Direct TSV output (skip Percolator):

msgf-rust --spectrum spectra.mzML --database db.fasta \
  --output-pin out.pin --output-tsv out.tsv

quantms pipeline integration:

Point quantms's PSM search step at msgf-rust and use the standard quantms post-processing. The .pin row format is the same; existing quantms scripts using legacy numeric flag values (--fragmentation 3 --instrument 3 --protocol 4) keep working without modification (see CLI_MIGRATION.md).

CLI summary

Most-used flags (full reference in DOCS.md §1):

Flag Purpose Default
--spectrum <FILE> Input mzML or MGF (required)
--database <FILE> Input FASTA (required)
--output-pin <FILE> Percolator PIN output (required)
--output-tsv <FILE> Optional TSV output (off)
--mods <FILE> mods.txt file (Cam-C + Ox-M built-in) (off)
--precursor-tol-ppm <FLOAT> Precursor mass tolerance 20.0
--isotope-error-min/-max <INT> Isotope error range -1, 2
--charge-min/-max <INT> Charge range when not in spectrum 2, 3
--enzyme-specificity <auto|...> NTT enforcement fully
--max-missed-cleavages <INT> Missed cleavages 1
--min/-max-length <INT> Peptide length range 6, 40
--min-peaks <INT> Min peaks per spectrum to score 10
--top-n <INT> PSMs retained per spectrum 10
--fragmentation <auto|...> Frag method (auto-detect from mzML if auto) auto
--instrument <low-res|...> Instrument class low-res
--protocol <auto|...> Search protocol auto
--param-file <FILE> Override bundled scoring model (auto-pick)
--threads <INT> Worker threads (logical CPUs)

Run msgf-rust --help for the auto-generated help with full descriptions.

Auto-detection

For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled .param file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from (fragmentation, instrument, protocol). MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit --fragmentation / --instrument flags). Full resolution table: DOCS.md §4.

Parity vs Java MS-GF+

PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: lnEValue (num_distinct semantics), MeanRelErrorTop7 (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (known-divergences.md item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: DOCS.md §8d.

Citation

If you use msgf-rust in published work, please cite the original MS-GF+ paper:

Kim, S. and Pevzner, P.A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications, 5:5277.

And optionally this Rust port:

bigbio (2026). msgf-rust: a Rust port of MS-GF+ for the quantms pipeline. https://github.com/bigbio/msgf-rust

License

msgf-rust inherits the upstream MS-GF+ UCSD-Noncommercial license. The license restricts redistribution and commercial use; see LICENSE for the full text and NOTICE for attribution. The original Java implementation is preserved on the java-legacy branch (frozen at the bigbio-optimized version) and java-legacy-original branch (synced to upstream MSGFPlus/msgfplus/master).

Acknowledgments

  • Sangtae Kim, Pavel Pevzner, and the PNNL Proteomics team at UCSD's Center for Computational Mass Spectrometry, for the original MS-GF+ engine and the bundled .param scoring models.
  • The bigbio maintainers and the quantms team.

About

A rust implementation (vibe code) of the MSGF+ search engine

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages