From 2609b0ad8e9d94d5f4161983f47506e94a8e2c93 Mon Sep 17 00:00:00 2001 From: percolator <95488+percolator@users.noreply.github.com> Date: Thu, 23 Apr 2026 17:30:41 +0000 Subject: [PATCH] Update interface docs from percolator source --- _posts/2010-10-27-interface.textile | 146 +++++++++++++++++++++------- 1 file changed, 110 insertions(+), 36 deletions(-) diff --git a/_posts/2010-10-27-interface.textile b/_posts/2010-10-27-interface.textile index 1ef7396..709092e 100644 --- a/_posts/2010-10-27-interface.textile +++ b/_posts/2010-10-27-interface.textile @@ -2,53 +2,127 @@ layout: default title: Interface categories: all -terminal_pres: true --- +h2. Input format (PIN) +Percolator accepts a tab-delimited input file (PIN format) where each row +describes one peptide-spectrum match (PSM): -h2. PIN and POUT file formats - -Since version 1.15, Percolator has its own input format called PIN (for Percolator-IN). PIN files are XML files whose structure is defined by the schema "percolator_in.xml":http://github.com/percolator/percolator/raw/xml-1-1/src/xml/percolator_in.xsd, (currently at its 1.1 version). - -Similarly, Percolator's output (called PUT for Percolator-OUT) is defined by the schema "percolator_out.xml":http://github.com/percolator/percolator/raw/xml-1-1/src/xml/percolator_out.xsd, (also at its 1.1 version). - -Supposing pin.xml is a valid XML file in PIN format, Percolator can be run using the following options: -
$ percolator [options] -E pin.xml -X output.xml
+
id \t label \t scannr \t feature1 \t ... \t featureN \t peptide \t proteinId1 \t ... \t proteinIdM
 
-After a successful termination, output.xml will contain Percolator's output formatted in POUT format; the file will be overwritten, or created if not already existing. +* @label@ is @1@ for target PSMs and @-1@ for decoys. +* Features are numerical values used by the SVM. +* The @peptide@ column uses the format @X.PEPTIDE.X@ where X is the flanking amino acid. -h2. SQT file format +PIN files are typically generated by converters such as +"crux":https://crux.ms, "ms2rescore":https://ms2rescore.readthedocs.io, or +"oktoberfest":https://oktoberfest.readthedocs.io. -Percolator 1.14 and below accepted "SQT files":http://www.ncbi.nlm.nih.gov/pubmed/15317041 in input; SQT files are the default output of "Crux":http://noble.gs.washington.edu/proj/crux software toolkit. Although SQT files are not supported directly any more, Percolator is now shipped with a converter program called Sqt2pin available from the "downloads":http://github.com/percolator/percolator/downloads section. Supposing target.sqt and decoy.sqt are two valid SQT files, they can be (merged and) converted into a PIN file by invoking sqt2pin with the following options: -
$ sqt2pin [options] -o output.xml target.sqt decoy.sqt
-
-After a successful termination, output.xml will contain Sqt2pin's output formatted in PIN format; the file will be overwritten, or created if not already existing. +h2. percolator -Percolator does not support SRF files, which are "Bioworks":http://www.thermo.com/bioworks/ Sequest output files, but those can be converted to SQT files using John Prince's Srf2sqt; it is available at "github":http://github.com/jtprince/ms-sequest. Alternativelyit can be obtained by running "gem":http://rubygems.org/ -
$ gem install ms-sequest
+
percolator [options] pin.tsv
 
- -h2. Tab-delimited file format - -Percolator accepts input in a simple Tab-delimited format where each row contains a per PSM: -
id  label  feature1  ...  featureN  peptide  proteinId1  ..  proteinIdM 
-
-label is a flag set to 1 for target PSMs, and -1 for decoys. If Percolator is run with --doc option the first and second feature (third and fourth column) should contain the retention time and difference between observed and calculated mass/charge. Example files may be downloaded "here":http://noble.gs.washington.edu/proj/percolator/data/yeast-01.tab.gz - -To run Percolator on a Tab-delimited input, use the following optins: -
$ percolator -j tab-delimited -X output.xml
+
+  
+    
+  
+  
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+  
+
OptionDescription
-Q <filename>, --pepxml-output <filename>Write a rudimentary pepXML file with PSM-level statistics to the specified file.
--stdinput-tabRead tab-delimited PIN format from standard input.
-p <value>, --Cpos <value>SVM cost for positive examples. Determined by cross-validation if not specified.
-n <value>, --Cneg <value>SVM cost for negative examples. Determined by cross-validation if not specified, or if -p is not specified.
-t <value>, --testFDR <value>FDR threshold for selecting the best cross-validation result and for the final reported results. Default = 0.01.
-F <value>, --trainFDR <value>FDR threshold for defining positive training examples. Defaults to --testFDR when set to 0. Default = 0.01.
-i <number>, --maxiter <number>Maximum number of iterations. Default = 10.
-N <number>, --subset-max-train <number>Only train an SVM on a subset of <x> PSMs, and use the resulting score vector to evaluate the other PSMs. Recommended when analyzing huge numbers (>1 million) of PSMs. When set to 0, all PSMs are used for training as normal. Default = 0.
-x, --quick-validationQuicker execution by reduced internal cross-validation.
-J <filename>, --tab-out <filename>Output computed features to the specified file in tab-delimited PIN format.
-j <filename>, --tab-in <filename>Input file in tab-delimited PIN format. Tab-delimited input is the default; this flag exists for backwards compatibility.
-K <char>, --protein-name-separator <char>The separator used in tab delimited input and output for separating protein names. Default = tab-char.
-w <filename>, --weights <filename>Output final SVM weights to the given file.
-W <filename>, --init-weights <filename>Read the unnormalized initial weights from the third line of the given file. This can be the output of the --weights option from a previous Percolator analysis. Note that the weights must be in the same order as the features in the PSM input file(s).
--staticUse the provided initial weights as a static model. If used, the --init-weights option must be specified.
-V <[-]?featureName>, --default-direction <[-]?featureName>Use given feature name as initial search direction, can be negated to indicate that a lower value is better.
-v <level>, --verbose <level>Set verbosity of output: 0=no processing info, 5=all. Default = 2.
-o, --no-terminateDo not stop execution when encountering questionable SVM inputs or results.
-u, --unitnormUse unit normalization [0-1] instead of standard deviation normalization.
-R, --test-each-iterationMeasure performance on the test set after each iteration.
-O, --overrideOverride error check and do not fall back on default score vector in case of suspect score vector from SVM.
-S <value>, --seed <value>Set seed of the random number generator. Default = 1.
-r <filename>, --results-peptides <filename>Output tab-delimited results of peptides to a file instead of stdout (ignored if used with -U).
-B <filename>, --decoy-results-peptides <filename>Output tab-delimited results for decoy peptides to a file (ignored if used with -U).
-m <filename>, --results-psms <filename>Output tab-delimited results of PSMs to a file instead of stdout.
-M <filename>, --decoy-results-psms <filename>Output tab-delimited results for decoy PSMs to a file.
-U, --only-psmsReport only PSM-level results, skipping peptide-level aggregation.
-y, --post-processing-mix-maxUse the mix-max method to assign q-values and PEPs. Note that this option only has an effect if the input PSMs are from separate target and decoy searches. This is the default setting.
-Y, --post-processing-tdcReplace the mix-max method by target-decoy competition for assigning q-values and PEPs. If the input PSMs are from separate target and decoy searches, Percolator's SVM scores will be used to eliminate the lower scoring target or decoy PSM(s) of each scan+expMass combination. If the input PSMs are detected to be coming from a concatenated search, this option will be turned on automatically, as this is incompatible with the mix-max method. In case this detection fails, turn this option on explicitly.
-I <value>, --search-input <value>Specify the type of target-decoy search: \"auto\" (Percolator attempts to detect the search type automatically), \"concatenated\" (single search on concatenated target-decoy protein db) or \"separate\" (two searches, one against target and one against decoy protein db). Default = \"auto\".
-f <value>, --picked-protein <value>Use the picked protein-level FDR to infer protein probabilities. Provide the fasta file as the argument to this flag, which will be used for protein grouping based on an in-silico digest. If no fasta file is available or protein grouping is not desired, set this flag to \"auto\" to skip protein grouping.
-l <filename>, --results-proteins <filename>Output tab-delimited results of proteins to a file instead of stdout (only valid with -f).
-L <filename>, --decoy-results-proteins <filename>Output tab-delimited results for decoy proteins to a file (only valid with -f).
-P <value>, --protein-decoy-pattern <value>Define the prefix to identify decoy proteins in the database for the picked-protein algorithm. This will have no effect on the target/decoy labels specified in the input file. One of \"auto\" (Percolator guesses the prefix from the input file) or \"'<DECOY NAME>'\" (search for decoys using the '<DECOY NAME>' pattern). Default = \"auto\".
-z, --protein-enzymeType of enzyme used for in silico protein digestion for picked protein-level FDR estimation. One of \"no_enzyme\",\"elastase\",\"pepsin\",\"proteinasek\",\"thermolysin\",\"trypsinp\",\"chymotrypsin\",\"lys-n\",\"lys-c\",\"arg-c\",\"asp-n\",\"glu-c\",\"trypsin\". Default=\"trypsin\".
-c, --protein-report-fragmentsBy default, if the peptides associated with protein A are a proper subset of the peptides associated with protein B, then protein A is eliminated and all the peptides are considered as evidence for protein B. Note that this filtering is done based on the complete set of peptides in the database, not based on the identified peptides in the search results. Alternatively, if this option is set and if all of the identified peptides associated with protein B are also associated with protein A, then Percolator will report a comma-separated list of protein IDs, where the full-length protein B is first in the list and the fragment protein A is listed second. Commas inside protein IDs will be replaced by semicolons. Not available for Fido.
-g, --protein-report-duplicatesIf this option is set and multiple database proteins contain exactly the same set of peptides, then the IDs of these duplicated proteins will be reported as a comma-separated list, instead of the default behavior of randomly discarding all but one of the proteins. Commas inside protein IDs will be replaced by semicolons. Not available for Fido.
--no-analyticsSwitch off analytics reporting.
[experimental]
--num-threads <value>
Number of total parallel threads for SVM training during cross validation. Default (one thread per CV fold) = 3.
[experimental]
--nested-xval-bins <value>
Number of nested cross validation bins within each cross validation bin. This should reduce overfitting of the hyperparameters. Default = 1.
[experimental]
--spectral-counting-fdr <value>
Activate spectral counting at the protein level (requires --picked-protein) at the specified PSM q-value threshold. Adds two columns, \"spec_count_unique\" and \"spec_count_all\", to the protein tab separated output, containing the spectral count for the peptides unique to the protein and the spectral count including shared peptides respectively.
[experimental]
--train-best-positive
Enforce that, for each spectrum, at most one PSM is included in the positive set during each training iteration. If the user only provides one PSM per spectrum, this filter will have no effect.
[experimental]
--train-fdr-initial <value>
Set the FDR threshold for the first iteration. This is useful in cases where the original features do not display a good separation between targets and decoys. In subsequent iterations, the normal --trainFDR will be used.
[experimental]
--parameter-file <filename>
Read flags from a parameter file. If flags are specified on the command line as well, these will override the ones in the parameter file.
[experimental]
--reset-algorithm
Run an implementation of the Percolator-RESET Algorithm.
[experimental]
--composition-match
Run an implementation of the Percolator-RESET psmsAndPeptides with target-decoy matching based on composition.
-RT, --output-retention-timeAdd a retention time column to the output file.
--irls-pepCalculate PEPs using a cubic spline fitted using penalized log-likelihood fitting as described in PMID:18689838. This used to be the default method.
[experimental]
--ip-pep
Use scores instead of rank as independent variable when calculating PEPs.
[experimental]
--pava-pep
Calculate PEPs using PAVA isotonic regression.
+ +h2. qvality + +Qvality estimates posterior error probabilities (PEP) and q-values from a +set of target and decoy scores. + +
qvality [options] target_scores.txt decoy_scores.txt
 
-where tab-delimited is a valid Tab-delimited file. -Supposing pin.xml is a valid XML file in PIN format, it is possible to use Percolator as a converter and generate Tab-delimited files from PIN files by using the following options: -
$ percolator -E pin.xml -J tab-delimited
+Alternatively, accurate p-values can be provided in a single file.
+
+
+  
+    
+  
+  
+    
+    
+    
+    
+    
+    
+    
+    
+  
+
OptionDescription
-v <level>, --verbose <level>Set verbosity of output: 0=no processing info, 5=all. Default = 2.
-s <value>, --epsilon-step <value>Relative step size threshold before cross-validation error is calculated.
-n <bins>, --number-of-bins <bins>The number of spline knots used when interpolating spline function. Default is 500.
-c <value>, --epsilon-cross-validation <value>Relative cross-validation step size threshold for ending iterations.
-r, --reverseIndicate that lower scores are better (reversed scoring).
-o <file>, --output-file <file>Output results to file instead of stdout
-Y, --tdc-inputTurn off the pi0 correction for results from a concatenated target-decoy database.
-d, --include-negativeInclude decoy probabilities in the output.
+ +h2. picked-protein + +Picked-protein infers protein-level FDR from Percolator peptide output. + +
picked-protein -i percolator_peptides.tsv -d database.fasta -o proteins.tsv
 
-After a successful termination, tab-delimited will contain a valid Tab-delimited file that can be fed to Percolator through the -j described above; the file will be overwritten, or created if not already existing. - -h2. Mascot - -"Mascot":http://matrixscience.com users might be interested in the Mascot file parser for Percolator: "mascot-percolator":http://www.sanger.ac.uk/Software/analysis/MascotPercolator/ written by "Markus Brosch":http://www.sanger.ac.uk/Teams/Team17/team.shtml at Sanger institute. -Percolator is provided as a part of "Mascot Server":http://matrixscience.com version 2.3. + + + + + + + + + + +
OptionDescription
-v <level>, --verbose <level>Set verbosity of output: 0=no processing info, 5=all. Default = 2.
-i <filename>, --peptide-in <filename>Tab-delimited peptide results file from Percolator.
-d <filename>, --database <filename>Protein sequence database in FASTA format.
-o <filename>, --protein-out <filename>Output file for inferred protein results.