griffithlab · chrisamiller · Jun 30, 2026 · Jun 30, 2026
diff --git a/docs/index.rst b/docs/index.rst
@@ -10,31 +10,34 @@ VAtools is a python package that includes several tools to
 annotate VCF files with data from other tools.
 
 **vcf-readcount-annotator**
-    A tool that will add the data from bam-readcount files to the VCF sample
-    column.
+    A tool that will add the data from `bam-readcount <https://github.com/genome/bam-readcount>`_ files to the VCF sample
+    column. Writes depth, allele counts, and VAFs; optionally also writes
+    detailed per-read quality metrics (mapping quality, base quality, strand
+    counts, and more) as additional FORMAT fields.
 
 **vcf-expression-annotator**
-    A tool that will add the data from several expression tools' output files
-    to the VCF INFO column. Supported tools are StringTie, Kallisto,
-    and Cufflinks. There also is a ``custom`` option to annotate with data
+    A tool that will add gene expression data to the VCF INFO column. 
+    Directly supports outputs from `StringTie <https://github.com/gpertea/stringtie>`_, `Kallisto <https://kallisto.readthedocs.io/en/latest/>`_, and `Cufflinks <https://github.com/cole-trapnell-lab/cufflinks>`_. 
+    There also is a ``custom`` option to allow annotation with data
     from any tab-delimited file.
 
 **vcf-info-annotator**
-    A tool that will add data from a tab-delimited file to any user-specified
-    field in the VCF INFO column.
+    A general-purpose tool that will add data from a tab-delimited file into VCF INFO fields.
+    Supports mapping multiple TSV columns to multiple INFO fields in a single
+    pass.
 
 **vcf-genotype-annotator**
     A tool to add a new sample to an existing VCF file or fill in the GT field
-    for an existing sample in a VCF.
+    for an existing sample in a VCF. Fills a need for genotype manipulation in
+    VCFs that don't contain one, but causes downstream tools to choke.
 
 **vep-annotation-reporter**
-    A tool to create a tab-delimited (TSV) file of variants in a VCF and their
-    VEP annotations.
+    A tool to parse the complex `VEP <https://www.ensembl.org/vep>`_-added CSQ field from a VCF and create a tab-delimited (TSV) file of variants and their VEP annotations.
 
 **ref-transcript-mismatch-reporter**
     A tool to identify variants in a VCF where the reference genome used to
     align and call variants doesn't match the Ensembl reference transcript
-    used by VEP for variant consequence annotations.
+    used by VEP for variant consequence annotations. 
 
 **transform-split-values**
     A tool that extracts and manipulates values from existing sample fields

diff --git a/docs/install.rst b/docs/install.rst
@@ -3,49 +3,30 @@ Install
 
 The VAtools suite is written for Linux and Mac OS X.
 If you are using Windows you will need to set up a
-Linux environment, for example by setting up a virtual machine.
+Linux environment, for example by using WSL or setting up a virtual machine.
 
-VAtools requires Python 3.8 or above. Before running any
-installation steps, check the Python version installed on your system:
+VAtools requires Python 3.8 or above. Check the Python version installed on your system with:
 
 .. code-block:: none
 
-   python -V
+   python --version
 
-If you don't have Python 3 installed, we recommend using `Conda
-<http://conda.pydata.org/docs/py2or3.html>`_ to emulate a Python 3.
-environment. We've encountered problems with users that already have Python
-2.x installed when they also try to install Python 3. The defaults will
-not be set correctly in that case. If you already have Python 2.x installed
-we **strongly** recommmend using Conda instead of installing Python 3
-locally.
+pip
+---
 
-Once you have set up your Python 3 environment correctly you can use
-``pip`` to install VAtools. Make sure you have ``pip``
-installed. ``pip`` is generally included in python distributions, but may
-need to be upgraded before use. See the `instructions
-<https://packaging.python.org/en/latest/installing/#install-pip-setuptools-and-wheel>`_
-for installing or upgrading ``pip``.
-
-After you have pip installed, type the following command on your Terminal:
+Install VAtools using ``pip``:
 
 .. code-block:: none
 
    pip install vatools
 
-You can check that the ``vatools`` package has been installed
-under the default environment by running this command:
+You can verify the installation with:
 
 .. code-block:: none
 
    pip show vatools
 
-``pip`` will fetch and install VAtools and its dependencies for you.
-After installing, each tool of the VAtools package is available in
-its own command line tree directly from the Terminal.
-
-If you have an old version of the vatools package installed you might
-want to consider upgrading to the latest version:
+To upgrade an existing installation:
 
 .. code-block:: none
 
@@ -56,3 +37,22 @@ Docker
 
 A Docker container for VAtools is available on DockerHub using the
 `griffithlab/vatools <https://hub.docker.com/r/griffithlab/vatools/>`_ repo.
+
+.. code-block:: none
+
+   docker pull griffithlab/vatools
+
+Run any tool inside the container by passing it as the command. For example:
+
+.. code-block:: none
+
+   docker run griffithlab/vatools vcf-readcount-annotator --help
+
+To annotate a VCF with files on your local filesystem, mount the directory
+containing your data as a volume:
+
+.. code-block:: none
+
+   docker run -v /path/to/data:/data griffithlab/vatools \
+     vcf-readcount-annotator /data/input.vcf /data/readcounts.tsv DNA \
+     -o /data/output.vcf
diff --git a/docs/ref_transcript_mismatch_reporter.rst b/docs/ref_transcript_mismatch_reporter.rst
@@ -3,35 +3,29 @@ Reference Transcript Mismatch Reporter
 
 This tool can be used to identify variants where the reference genome build
 doesn't match the Ensembl reference transcript used by VEP for variant consequence
-annotations. In these cases, the REF nucleotide(s) at a variant position will differ
+annotations. This is problematic, because at this positions, the REF nucleotide(s) at a variant position will differ
 from the Ensembl transcript nucleotide(s) at the corresponding mutation
-position. Any resulting amino acid change predictions found in the ``Amino_acids``
-field of the VEP CSQ annotation will then be different from the translated Ensembl transcript
+position. The end result is that any amino acid change predictions found in the ``Amino_acids``
+field of the VEP CSQ annotation will be different from the translated Ensembl transcript
 amino acids at that position.
 
-This will lead to errors in some downstream tools, e.g. pVACseq, which rely on
-the ``Amino_acids`` field as well as the translated Ensembl transcript peptide
-sequence - as reported by the ``Wildtype`` plugin to make predictions about
-the impact of the mutation on the transcript peptide sequence. Since those two
-fields will be in disagreement in such cases as described above, pVACseq
+This will lead to errors in some downstream tools (e.g. pVACseq) which rely on accurate peptide sequences in order to make predictions about neoantigen MHC binding. When mismatches occur, these two
+fields will be in disagreement and pVACseq
 cannot make predictions on such variants.
 
-Such errors might occur in a small number of variants if there are only minor
-differences between the reference used and the Ensembl transcripts but they
-might also be more widespread, for example, if users aligned to GRCh37 but
-used a GRCh38 VEP cache.
+If there are only minor differences between the reference used and the Ensembl transcripts, only a small number of variants may be affected. If a large number of variants are flagged, then it would be wise to step back and consider whether there is a fundamental lack of compatibility (for example, alignments to GRCh37 that use a GRCh38 VEP cache).
 
 The input VCF needs to be annotated by VEP and requires annotation with the
 ``Wildtype`` VEP plugin available as part of `pVACtools <https://pvactools.readthedocs.io/en/latest/pvacseq/input_file_prep/vep.html#installing-vep>`_.
 
 This tool will report on the number of variants and transcripts in a VCF that
-are affected by this issuei and output this information to stdout. It will
+are affected by this issue and output this information to stdout. It will
 write a ``.mismatch.tsv`` file next to the VCF that provides further details
 on the problematic variants.
 
 This tool also allows the user to either soft-filter or hard-filter the VCF
 using the ``--filter [soft|hard]`` parameter. Soft-filtering will tag the
-problematic variants with a custom VCF FILTER ``CSQ_MISMACH`` while hard-filtering
+problematic variants with a custom VCF FILTER ``CSQ_MISMATCH`` while hard-filtering
 will produce a new VCF that has these variants removed. When using a filter,
 the output VCF will be written to a ``filtered.vcf`` file next to
 your input VCF file. You can set a different output file using the
@@ -47,4 +41,4 @@ Example Command
 
 .. code-block:: none
 
-   ref-transcript-mismatch-report input.vcf --filter soft
+   ref-transcript-mismatch-reporter input.vcf --filter soft
diff --git a/docs/transform_split_values.rst b/docs/transform_split_values.rst
@@ -18,9 +18,9 @@ Supported operations are the following:
 - ``ref_ratio``: The first value in a R-number field divided by the sum of all the numbers (the reference ratio).
 - ``alt_ratio``: The second value in a R-number field divided by the sum of all the numbers (the alt ratio).
 
-If your VCF is a multi-sample VCF, you have to pick one of the sample in
-your VCF by setting the ``--sample-name`` option. This is the sample that the
-readcounts will be written for.
+If your VCF is a multi-sample VCF, you have to pick one of the samples in
+your VCF by setting the ``--sample-name`` option. This is the sample whose
+field values will be extracted.
 
 By default the output TSV will be written to a ``.tsv`` file next to
 your input VCF file. You can set a different output file using the
@@ -30,3 +30,10 @@ Usage
 -----
 
 .. program-output:: transform-split-values -h
+
+Example Command
+---------------
+
+.. code-block:: none
+
+   transform-split-values input.vcf AD alt -o output.tsv
diff --git a/docs/vcf_expression_annotator.rst b/docs/vcf_expression_annotator.rst
@@ -1,35 +1,24 @@
 VCF Expression Annotator
 ========================
 
-The VCF Expression Annotator will take an output file from Cufflinks, Kallisto,
-or StringTie and add the data from that file to your VCF. The expression file type is
-specified using ``kallisto``, ``stringtie``, or ``cufflinks`` in the list of
-positional parameters.
-
-In addition, the type of expression data, either ``gene`` or ``transcript``, needs to
-be specified. This will result in the expression value to be written to the
-``GX`` or ``TX`` field, respectively.
-
-The input VCF needs to be annotated with VEP with gene and transcript information so
-that the VCF Expression Annotator can match a variant's Ensembl gene and transcript
-identifier in the VCF to the one in the expression file. When running in
-``gene`` mode, Ensembl IDs - not gene names - are used. Depending on the
-expression software used, the transcript identifiers might contain version
-numbers. To add transcript version numbers to your VEP annotation, use the
-``--transcript_version`` when running VEP. You can also use the
-``--ignore-ensembl-id-version`` flag of the VCF Expression Annotator to ignore
-the version of Ensembl gene and transcript IDs when finding the matching entry in your expression
-file.
-
-The VCF Expression Annotator also accepts a custom tab-delimited (TSV) file input for the
-expression file. This TSV file will need to contain one column with gene or
-transcrip Ensembl IDs and one column with the expression values. This file
-then needs to contain a header line that is used to
-identify the contents of each column. This is done via the  ``--id-column``
-and ``--expression-column`` parameters which need
-to match the gene/transcript identifier and expression value column headers.
-In order to use this option the expression file format option will need to be
-set to ``custom``. Please note that when running in ``gene`` mode, the ID
+The VCF Expression Annotator takes gene expression data and adds it to a VCF, allowing downstream tools to ask "how well is the gene containing this variant expressed?"
+
+`Link Text <https://example.com>`_
+
+The tool natively supports output files from `StringTie <https://github.com/gpertea/stringtie>`_, `Kallisto <https://kallisto.readthedocs.io/en/latest/>`_, or `Cufflinks <https://github.com/cole-trapnell-lab/cufflinks>`_, by specifying the appropriate format in the positional parameters: ``kallisto``, ``stringtie``, or ``cufflinks``.
+
+In addition, the type of expression data, either ``gene`` or ``transcript``, needs to be specified. This will result in the expression value being written to the ``GX`` or ``TX`` field, respectively.
+
+The input VCF needs to be annotated with VEP with gene and transcript information so that the VCF Expression Annotator can match a variant's Ensembl gene and transcript identifier in the VCF to the one in the expression file. 
+
+When running in ``gene`` mode, Ensembl IDs - not gene names - are used. Depending on the expression software used, the transcript identifiers might contain version numbers. To add transcript version numbers to your VEP annotation, use the ``--transcript_version`` when running VEP. 
+
+You can also use the ``--ignore-ensembl-id-version`` flag of the VCF Expression Annotator to ignore the version of Ensembl gene and transcript IDs when finding the matching entry in your expression file.
+
+#### Custom Expression Data
+VCF Expression Annotator can be used with other tools, so long as their expression output can be manipulated into a TSV containing two columns: Ensembl gene or transcript ID and expression values. This file also needs to contain a header line that is used to identify the contents of each column. These headers are specified via the ``--id-column`` and ``--expression-column`` parameters.
+
+In order to use this option the ``custom`` value should be give in the file format parameter. Please note that when running in ``gene`` mode, the ID
 column will need to contain Ensembl Gene IDs, not gene names.
 
 By default the output VCF will be written to a ``.tx.vcf`` or ``.gx.vcf`` file next to

diff --git a/docs/vcf_genotype_annotator.rst b/docs/vcf_genotype_annotator.rst
@@ -5,7 +5,7 @@ The VCF Genotype Annotator will take an existing VCF file and add a new sample
 to it. The sample name is set via the second positional argument.
 The sample's GT field is pre-populated with a default value given by
 the third positional argument. Options are ``0/1``, ``1/1``, ``0/0``, or
-``.``.
+``.``.  This is useful because some tools do not generate sample/genotype fields, and some downstream tools require them. 
 
 It can also be used to add a GT field to an existing sample, e.g. for VCFs
 created by Strelka which does not output a GT field for its calls.
@@ -18,3 +18,10 @@ Usage
 -----
 
 .. program-output:: vcf-genotype-annotator -h
+
+Example Command
+---------------
+
+.. code-block:: none
+
+   vcf-genotype-annotator input.vcf TUMOR 0/1 -o output.vcf
diff --git a/docs/vcf_info_annotator.rst b/docs/vcf_info_annotator.rst
@@ -1,30 +1,105 @@
 VCF Info Annotator
 ==================
 
-The VCF Info Annotator will add data form a tab-delimited (TSV) file to a
-VCF's INFO column.
+The VCF Info Annotator adds data from a tab-delimited (TSV) file to a
+VCF's INFO column. It supports annotating one or more INFO fields in a
+single run, and can handle any VCF-spec data type.
 
-The TSV file needs to contain three columns in the following order:
+Input TSV format
+----------------
 
-- chromosome
-- position
-- the value for your field at that position
+The TSV file must have a header row. The first two columns must be
+chromosome and position (one-based coordinates, used to match rows to VCF records). Any
+additional columns can be mapped to VCF INFO fields by name.
 
-To define the new INFO field you need to specify a info field name in the
-positional parameters. This term will be used as the ID field in the INFO
-header. You will also need to specify a description in quotes that will be
-used as the Description field in the INFO header. Lastly, you will need to
-specify the format of your data. This can be either be ``Integer``, ``Float``,
-``Flag``, ``Character``, or ``String``.
+Example TSV with two data columns::
 
-Optional, you can also set the Source and Version fields of the INFO header
-using the ``--source`` and ``--version`` parameters, respectively.
+   chrom   pos         freq    classification
+   chr1    168192360   0.042   benign
+   chr1    230456789   0.187   pathogenic
 
-By default the output VCF will be written to a ``.info.vcf`` file next to
-your input VCF file. You can set a different output file using the
-``--output-vcf`` parameter.
+Gzip-compressed TSV files (``.tsv.gz``) are also accepted.
+
+Defining column mappings
+------------------------
+
+Use the ``-m`` / ``--column-mappings`` flag to specify how TSV columns
+map to VCF INFO fields. Each mapping is a colon-delimited string with
+four required fields and two optional fields:
+
+.. code-block:: none
+
+   source_col:info_field:type:description[:source[:version]]
+
+- **source_col** — the column name in the TSV header
+- **info_field** — the ID to use for the INFO field in the VCF
+- **type** — the VCF data type: ``Integer``, ``Float``, ``Flag``, ``Character``, or ``String``
+- **description** — free-text description written to the INFO header line
+- **source** *(optional)* — the source database or tool name
+- **version** *(optional)* — the source version (requires source to be set)
+
+To annotate multiple INFO fields in one run, separate mappings with a
+comma:
+
+.. code-block:: none
+
+   -m "col1:FIELD1:type:description,col2:FIELD2:type:description"
+
+Overwriting existing fields
+---------------------------
+
+By default, the tool raises an error if the VCF already contains an
+INFO field with the same ID as a mapped field. Use ``--overwrite``
+(``-w``) to allow writing to existing fields.
+
+``--clear-existing`` extends this behavior: when set, the existing value
+is removed from **every** record before annotation, so records that have
+no matching TSV entry will have no value for that field rather than
+retaining the old one. ``--clear-existing`` requires ``--overwrite``.
+
+Output
+------
+
+By default the output VCF is written to a ``.info.vcf`` file next to
+your input VCF. Use ``--output-vcf`` to specify a different path.
 
 Usage
 -----
 
 .. program-output:: vcf-info-annotator -h
+
+Example Commands
+----------------
+
+Annotate a single Float field:
+
+.. code-block:: none
+
+   vcf-info-annotator input.vcf annotations.tsv \
+     -m "freq:FREQ:Float:Population allele frequency" \
+     -o output.vcf
+
+Annotate two fields in one run:
+
+.. code-block:: none
+
+   vcf-info-annotator input.vcf annotations.tsv \
+     -m "freq:FREQ:Float:Allele frequency,class:CVCLASS:String:ClinVar classification" \
+     -o output.vcf
+
+Include source and version in the INFO header:
+
+.. code-block:: none
+
+   vcf-info-annotator input.vcf annotations.tsv \
+     -m "freq:FREQ:Float:Population frequency:dbSNP:156" \
+     -o output.vcf
+
+Overwrite an existing field, clearing it from records not in the TSV:
+
+.. code-block:: none
+
+   vcf-info-annotator input.vcf annotations.tsv \
+     -m "score:MQ0:Integer:Mapping quality score" \
+     -w --clear-existing \
+     -o output.vcf