A Rust port of MS-GF+ — takes mzML/MGF spectra + FASTA in, produces Percolator-ready
.pinout. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster.
msgf-rust is a from-scratch Rust reimplementation of MS-GF+ (Kim & Pevzner, 2014), the canonical generating-function peptide-identification engine. It reads MS/MS spectra (mzML or MGF), searches them against a FASTA protein database, and emits Percolator-ready PIN rows (or a TSV) with per-PSM features for rescoring. The original Java implementation is preserved on the java-legacy branch.
Three datasets, three results (all at 1% FDR via Percolator 3.7.1):
| Dataset | Java MS-GF+ PSMs | msgf-rust PSMs | Δ | Java wall | msgf-rust wall | Wall Δ |
|---|---|---|---|---|---|---|
| Astral DDA (LFQ_Astral_DDA_15min_50ng) | 35,818 | 36,170 | +352 (+0.98%) | 5:49 | 5:57 | within 2% |
| PXD001819 (UPS1 yeast tryp) | 14,798 | 14,760 | -38 (-0.26%) | ~150s | 45.88s | 3.3× faster |
| TMT (a05058 PXD007683) | 10,166 | 11,108 | +9.3% | ~2:55 | 2:30 | 14% faster |
What that means: on Astral we find more peptide hits than Java; on PXD001819 we match Java's hit count at 3.3× the speed; on TMT we find ~9% more PSMs at 14% less wall. The remaining feature-level divergences (lnEValue, MeanRelErrorTop7 normalization) are tracked in DOCS.md §8d as research follow-up — they don't gate cutover.
Option 1 — download a release archive (recommended):
Grab the archive for your platform from the Releases page. Five platform builds are published per release:
msgf-rust-<version>-x86_64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-aarch64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-x86_64-apple-darwin.tar.gz
msgf-rust-<version>-aarch64-apple-darwin.tar.gz
msgf-rust-<version>-x86_64-pc-windows-msvc.zip
Each archive contains the msgf-rust binary, the resources/ tree (39 bundled .param files + unimod.obo), and LICENSE/NOTICE/README.
Option 2 — cargo install:
cargo install --git https://github.com/bigbio/msgf-rust --bin msgf-rustOption 3 — build from source:
git clone https://github.com/bigbio/msgf-rust
cd msgf-rust
cargo build --release
# Binary: target/release/msgf-rustRequires Rust 1.85+ (see rust-toolchain.toml).
msgf-rust \
--spectrum BSA.mgf \
--database BSA.fasta \
--output-pin out.pinThis runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to out.pin, and prints per-phase timings to stderr. Feed out.pin directly into Percolator (Docker or native) to compute q-values.
A row in out.pin is one peptide–spectrum match with 28 columns: SpecId, Label, ScanNr, charge one-hot encoding, then features like RawScore, lnSpecEValue, DeNovoScore, ion-current ratios, peptide-length stats, etc. Full column reference: DOCS.md §3a.
Tryptic DDA + Percolator (default):
msgf-rust --spectrum spectra.mzML --database db.fasta --output-pin out.pin
docker run --rm -v $(pwd):/data biocontainers/percolator:v3.7.1_cv1 \
percolator -X /data/weights.txt /data/out.pinTMT 10-plex search with mods.txt:
msgf-rust \
--spectrum tmt_spectra.mzML \
--database hsapiens.fasta \
--output-pin out.pin \
--mods tmt_10plex_mods.txt \
--protocol TMT \
--fragmentation HCD \
--instrument QExactiveDirect TSV output (skip Percolator):
msgf-rust --spectrum spectra.mzML --database db.fasta \
--output-pin out.pin --output-tsv out.tsvquantms pipeline integration:
Point quantms's PSM search step at msgf-rust and use the standard quantms post-processing. The .pin row format is the same; existing quantms scripts using legacy numeric flag values (--fragmentation 3 --instrument 3 --protocol 4) keep working without modification (see CLI_MIGRATION.md).
Most-used flags (full reference in DOCS.md §1):
| Flag | Purpose | Default |
|---|---|---|
--spectrum <FILE> |
Input mzML or MGF | (required) |
--database <FILE> |
Input FASTA | (required) |
--output-pin <FILE> |
Percolator PIN output | (required) |
--output-tsv <FILE> |
Optional TSV output | (off) |
--mods <FILE> |
mods.txt file (Cam-C + Ox-M built-in) | (off) |
--precursor-tol-ppm <FLOAT> |
Precursor mass tolerance | 20.0 |
--isotope-error-min/-max <INT> |
Isotope error range | -1, 2 |
--charge-min/-max <INT> |
Charge range when not in spectrum | 2, 3 |
--enzyme-specificity <auto|...> |
NTT enforcement | fully |
--max-missed-cleavages <INT> |
Missed cleavages | 1 |
--min/-max-length <INT> |
Peptide length range | 6, 40 |
--min-peaks <INT> |
Min peaks per spectrum to score | 10 |
--top-n <INT> |
PSMs retained per spectrum | 10 |
--fragmentation <auto|...> |
Frag method (auto-detect from mzML if auto) |
auto |
--instrument <low-res|...> |
Instrument class | low-res |
--protocol <auto|...> |
Search protocol | auto |
--param-file <FILE> |
Override bundled scoring model | (auto-pick) |
--threads <INT> |
Worker threads | (logical CPUs) |
Run msgf-rust --help for the auto-generated help with full descriptions.
For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled .param file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from (fragmentation, instrument, protocol). MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit --fragmentation / --instrument flags). Full resolution table: DOCS.md §4.
PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: lnEValue (num_distinct semantics), MeanRelErrorTop7 (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (known-divergences.md item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: DOCS.md §8d.
If you use msgf-rust in published work, please cite the original MS-GF+ paper:
Kim, S. and Pevzner, P.A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications, 5:5277.
And optionally this Rust port:
bigbio (2026). msgf-rust: a Rust port of MS-GF+ for the quantms pipeline. https://github.com/bigbio/msgf-rust
msgf-rust inherits the upstream MS-GF+ UCSD-Noncommercial license. The license restricts redistribution and commercial use; see LICENSE for the full text and NOTICE for attribution. The original Java implementation is preserved on the java-legacy branch (frozen at the bigbio-optimized version) and java-legacy-original branch (synced to upstream MSGFPlus/msgfplus/master).