Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,34 @@ VAtools is a python package that includes several tools to
annotate VCF files with data from other tools.

**vcf-readcount-annotator**
A tool that will add the data from bam-readcount files to the VCF sample
column.
A tool that will add the data from `bam-readcount <https://github.com/genome/bam-readcount>`_ files to the VCF sample
column. Writes depth, allele counts, and VAFs; optionally also writes
detailed per-read quality metrics (mapping quality, base quality, strand
counts, and more) as additional FORMAT fields.

**vcf-expression-annotator**
A tool that will add the data from several expression tools' output files
to the VCF INFO column. Supported tools are StringTie, Kallisto,
and Cufflinks. There also is a ``custom`` option to annotate with data
A tool that will add gene expression data to the VCF INFO column.
Directly supports outputs from `StringTie <https://github.com/gpertea/stringtie>`_, `Kallisto <https://kallisto.readthedocs.io/en/latest/>`_, and `Cufflinks <https://github.com/cole-trapnell-lab/cufflinks>`_.
There also is a ``custom`` option to allow annotation with data
from any tab-delimited file.

**vcf-info-annotator**
A tool that will add data from a tab-delimited file to any user-specified
field in the VCF INFO column.
A general-purpose tool that will add data from a tab-delimited file into VCF INFO fields.
Supports mapping multiple TSV columns to multiple INFO fields in a single
pass.

**vcf-genotype-annotator**
A tool to add a new sample to an existing VCF file or fill in the GT field
for an existing sample in a VCF.
for an existing sample in a VCF. Fills a need for genotype manipulation in
VCFs that don't contain one, but causes downstream tools to choke.

**vep-annotation-reporter**
A tool to create a tab-delimited (TSV) file of variants in a VCF and their
VEP annotations.
A tool to parse the complex `VEP <https://www.ensembl.org/vep>`_-added CSQ field from a VCF and create a tab-delimited (TSV) file of variants and their VEP annotations.

**ref-transcript-mismatch-reporter**
A tool to identify variants in a VCF where the reference genome used to
align and call variants doesn't match the Ensembl reference transcript
used by VEP for variant consequence annotations.
used by VEP for variant consequence annotations.

**transform-split-values**
A tool that extracts and manipulates values from existing sample fields
Expand Down
54 changes: 27 additions & 27 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,49 +3,30 @@ Install

The VAtools suite is written for Linux and Mac OS X.
If you are using Windows you will need to set up a
Linux environment, for example by setting up a virtual machine.
Linux environment, for example by using WSL or setting up a virtual machine.

VAtools requires Python 3.8 or above. Before running any
installation steps, check the Python version installed on your system:
VAtools requires Python 3.8 or above. Check the Python version installed on your system with:

.. code-block:: none

python -V
python --version

If you don't have Python 3 installed, we recommend using `Conda
<http://conda.pydata.org/docs/py2or3.html>`_ to emulate a Python 3.
environment. We've encountered problems with users that already have Python
2.x installed when they also try to install Python 3. The defaults will
not be set correctly in that case. If you already have Python 2.x installed
we **strongly** recommmend using Conda instead of installing Python 3
locally.
pip
---

Once you have set up your Python 3 environment correctly you can use
``pip`` to install VAtools. Make sure you have ``pip``
installed. ``pip`` is generally included in python distributions, but may
need to be upgraded before use. See the `instructions
<https://packaging.python.org/en/latest/installing/#install-pip-setuptools-and-wheel>`_
for installing or upgrading ``pip``.

After you have pip installed, type the following command on your Terminal:
Install VAtools using ``pip``:

.. code-block:: none

pip install vatools

You can check that the ``vatools`` package has been installed
under the default environment by running this command:
You can verify the installation with:

.. code-block:: none

pip show vatools

``pip`` will fetch and install VAtools and its dependencies for you.
After installing, each tool of the VAtools package is available in
its own command line tree directly from the Terminal.

If you have an old version of the vatools package installed you might
want to consider upgrading to the latest version:
To upgrade an existing installation:

.. code-block:: none

Expand All @@ -56,3 +37,22 @@ Docker

A Docker container for VAtools is available on DockerHub using the
`griffithlab/vatools <https://hub.docker.com/r/griffithlab/vatools/>`_ repo.

.. code-block:: none

docker pull griffithlab/vatools

Run any tool inside the container by passing it as the command. For example:

.. code-block:: none

docker run griffithlab/vatools vcf-readcount-annotator --help

To annotate a VCF with files on your local filesystem, mount the directory
containing your data as a volume:

.. code-block:: none

docker run -v /path/to/data:/data griffithlab/vatools \
vcf-readcount-annotator /data/input.vcf /data/readcounts.tsv DNA \
-o /data/output.vcf
24 changes: 9 additions & 15 deletions docs/ref_transcript_mismatch_reporter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,29 @@ Reference Transcript Mismatch Reporter

This tool can be used to identify variants where the reference genome build
doesn't match the Ensembl reference transcript used by VEP for variant consequence
annotations. In these cases, the REF nucleotide(s) at a variant position will differ
annotations. This is problematic, because at this positions, the REF nucleotide(s) at a variant position will differ
from the Ensembl transcript nucleotide(s) at the corresponding mutation
position. Any resulting amino acid change predictions found in the ``Amino_acids``
field of the VEP CSQ annotation will then be different from the translated Ensembl transcript
position. The end result is that any amino acid change predictions found in the ``Amino_acids``
field of the VEP CSQ annotation will be different from the translated Ensembl transcript
amino acids at that position.

This will lead to errors in some downstream tools, e.g. pVACseq, which rely on
the ``Amino_acids`` field as well as the translated Ensembl transcript peptide
sequence - as reported by the ``Wildtype`` plugin to make predictions about
the impact of the mutation on the transcript peptide sequence. Since those two
fields will be in disagreement in such cases as described above, pVACseq
This will lead to errors in some downstream tools (e.g. pVACseq) which rely on accurate peptide sequences in order to make predictions about neoantigen MHC binding. When mismatches occur, these two
fields will be in disagreement and pVACseq
cannot make predictions on such variants.

Such errors might occur in a small number of variants if there are only minor
differences between the reference used and the Ensembl transcripts but they
might also be more widespread, for example, if users aligned to GRCh37 but
used a GRCh38 VEP cache.
If there are only minor differences between the reference used and the Ensembl transcripts, only a small number of variants may be affected. If a large number of variants are flagged, then it would be wise to step back and consider whether there is a fundamental lack of compatibility (for example, alignments to GRCh37 that use a GRCh38 VEP cache).

The input VCF needs to be annotated by VEP and requires annotation with the
``Wildtype`` VEP plugin available as part of `pVACtools <https://pvactools.readthedocs.io/en/latest/pvacseq/input_file_prep/vep.html#installing-vep>`_.

This tool will report on the number of variants and transcripts in a VCF that
are affected by this issuei and output this information to stdout. It will
are affected by this issue and output this information to stdout. It will
write a ``.mismatch.tsv`` file next to the VCF that provides further details
on the problematic variants.

This tool also allows the user to either soft-filter or hard-filter the VCF
using the ``--filter [soft|hard]`` parameter. Soft-filtering will tag the
problematic variants with a custom VCF FILTER ``CSQ_MISMACH`` while hard-filtering
problematic variants with a custom VCF FILTER ``CSQ_MISMATCH`` while hard-filtering
will produce a new VCF that has these variants removed. When using a filter,
the output VCF will be written to a ``filtered.vcf`` file next to
your input VCF file. You can set a different output file using the
Expand All @@ -47,4 +41,4 @@ Example Command

.. code-block:: none

ref-transcript-mismatch-report input.vcf --filter soft
ref-transcript-mismatch-reporter input.vcf --filter soft
13 changes: 10 additions & 3 deletions docs/transform_split_values.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ Supported operations are the following:
- ``ref_ratio``: The first value in a R-number field divided by the sum of all the numbers (the reference ratio).
- ``alt_ratio``: The second value in a R-number field divided by the sum of all the numbers (the alt ratio).

If your VCF is a multi-sample VCF, you have to pick one of the sample in
your VCF by setting the ``--sample-name`` option. This is the sample that the
readcounts will be written for.
If your VCF is a multi-sample VCF, you have to pick one of the samples in
your VCF by setting the ``--sample-name`` option. This is the sample whose
field values will be extracted.

By default the output TSV will be written to a ``.tsv`` file next to
your input VCF file. You can set a different output file using the
Expand All @@ -30,3 +30,10 @@ Usage
-----

.. program-output:: transform-split-values -h

Example Command
---------------

.. code-block:: none

transform-split-values input.vcf AD alt -o output.tsv
47 changes: 18 additions & 29 deletions docs/vcf_expression_annotator.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,24 @@
VCF Expression Annotator
========================

The VCF Expression Annotator will take an output file from Cufflinks, Kallisto,
or StringTie and add the data from that file to your VCF. The expression file type is
specified using ``kallisto``, ``stringtie``, or ``cufflinks`` in the list of
positional parameters.

In addition, the type of expression data, either ``gene`` or ``transcript``, needs to
be specified. This will result in the expression value to be written to the
``GX`` or ``TX`` field, respectively.

The input VCF needs to be annotated with VEP with gene and transcript information so
that the VCF Expression Annotator can match a variant's Ensembl gene and transcript
identifier in the VCF to the one in the expression file. When running in
``gene`` mode, Ensembl IDs - not gene names - are used. Depending on the
expression software used, the transcript identifiers might contain version
numbers. To add transcript version numbers to your VEP annotation, use the
``--transcript_version`` when running VEP. You can also use the
``--ignore-ensembl-id-version`` flag of the VCF Expression Annotator to ignore
the version of Ensembl gene and transcript IDs when finding the matching entry in your expression
file.

The VCF Expression Annotator also accepts a custom tab-delimited (TSV) file input for the
expression file. This TSV file will need to contain one column with gene or
transcrip Ensembl IDs and one column with the expression values. This file
then needs to contain a header line that is used to
identify the contents of each column. This is done via the ``--id-column``
and ``--expression-column`` parameters which need
to match the gene/transcript identifier and expression value column headers.
In order to use this option the expression file format option will need to be
set to ``custom``. Please note that when running in ``gene`` mode, the ID
The VCF Expression Annotator takes gene expression data and adds it to a VCF, allowing downstream tools to ask "how well is the gene containing this variant expressed?"

`Link Text <https://example.com>`_

The tool natively supports output files from `StringTie <https://github.com/gpertea/stringtie>`_, `Kallisto <https://kallisto.readthedocs.io/en/latest/>`_, or `Cufflinks <https://github.com/cole-trapnell-lab/cufflinks>`_, by specifying the appropriate format in the positional parameters: ``kallisto``, ``stringtie``, or ``cufflinks``.

In addition, the type of expression data, either ``gene`` or ``transcript``, needs to be specified. This will result in the expression value being written to the ``GX`` or ``TX`` field, respectively.

The input VCF needs to be annotated with VEP with gene and transcript information so that the VCF Expression Annotator can match a variant's Ensembl gene and transcript identifier in the VCF to the one in the expression file.

When running in ``gene`` mode, Ensembl IDs - not gene names - are used. Depending on the expression software used, the transcript identifiers might contain version numbers. To add transcript version numbers to your VEP annotation, use the ``--transcript_version`` when running VEP.

You can also use the ``--ignore-ensembl-id-version`` flag of the VCF Expression Annotator to ignore the version of Ensembl gene and transcript IDs when finding the matching entry in your expression file.

#### Custom Expression Data
VCF Expression Annotator can be used with other tools, so long as their expression output can be manipulated into a TSV containing two columns: Ensembl gene or transcript ID and expression values. This file also needs to contain a header line that is used to identify the contents of each column. These headers are specified via the ``--id-column`` and ``--expression-column`` parameters.

In order to use this option the ``custom`` value should be give in the file format parameter. Please note that when running in ``gene`` mode, the ID
column will need to contain Ensembl Gene IDs, not gene names.

By default the output VCF will be written to a ``.tx.vcf`` or ``.gx.vcf`` file next to
Expand Down
9 changes: 8 additions & 1 deletion docs/vcf_genotype_annotator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The VCF Genotype Annotator will take an existing VCF file and add a new sample
to it. The sample name is set via the second positional argument.
The sample's GT field is pre-populated with a default value given by
the third positional argument. Options are ``0/1``, ``1/1``, ``0/0``, or
``.``.
``.``. This is useful because some tools do not generate sample/genotype fields, and some downstream tools require them.

It can also be used to add a GT field to an existing sample, e.g. for VCFs
created by Strelka which does not output a GT field for its calls.
Expand All @@ -18,3 +18,10 @@ Usage
-----

.. program-output:: vcf-genotype-annotator -h

Example Command
---------------

.. code-block:: none

vcf-genotype-annotator input.vcf TUMOR 0/1 -o output.vcf
109 changes: 92 additions & 17 deletions docs/vcf_info_annotator.rst
Original file line number Diff line number Diff line change
@@ -1,30 +1,105 @@
VCF Info Annotator
==================

The VCF Info Annotator will add data form a tab-delimited (TSV) file to a
VCF's INFO column.
The VCF Info Annotator adds data from a tab-delimited (TSV) file to a
VCF's INFO column. It supports annotating one or more INFO fields in a
single run, and can handle any VCF-spec data type.

The TSV file needs to contain three columns in the following order:
Input TSV format
----------------

- chromosome
- position
- the value for your field at that position
The TSV file must have a header row. The first two columns must be
chromosome and position (one-based coordinates, used to match rows to VCF records). Any
additional columns can be mapped to VCF INFO fields by name.

To define the new INFO field you need to specify a info field name in the
positional parameters. This term will be used as the ID field in the INFO
header. You will also need to specify a description in quotes that will be
used as the Description field in the INFO header. Lastly, you will need to
specify the format of your data. This can be either be ``Integer``, ``Float``,
``Flag``, ``Character``, or ``String``.
Example TSV with two data columns::

Optional, you can also set the Source and Version fields of the INFO header
using the ``--source`` and ``--version`` parameters, respectively.
chrom pos freq classification
chr1 168192360 0.042 benign
chr1 230456789 0.187 pathogenic

By default the output VCF will be written to a ``.info.vcf`` file next to
your input VCF file. You can set a different output file using the
``--output-vcf`` parameter.
Gzip-compressed TSV files (``.tsv.gz``) are also accepted.

Defining column mappings
------------------------

Use the ``-m`` / ``--column-mappings`` flag to specify how TSV columns
map to VCF INFO fields. Each mapping is a colon-delimited string with
four required fields and two optional fields:

.. code-block:: none

source_col:info_field:type:description[:source[:version]]

- **source_col** — the column name in the TSV header
- **info_field** — the ID to use for the INFO field in the VCF
- **type** — the VCF data type: ``Integer``, ``Float``, ``Flag``, ``Character``, or ``String``
- **description** — free-text description written to the INFO header line
- **source** *(optional)* — the source database or tool name
- **version** *(optional)* — the source version (requires source to be set)

To annotate multiple INFO fields in one run, separate mappings with a
comma:

.. code-block:: none

-m "col1:FIELD1:type:description,col2:FIELD2:type:description"

Overwriting existing fields
---------------------------

By default, the tool raises an error if the VCF already contains an
INFO field with the same ID as a mapped field. Use ``--overwrite``
(``-w``) to allow writing to existing fields.

``--clear-existing`` extends this behavior: when set, the existing value
is removed from **every** record before annotation, so records that have
no matching TSV entry will have no value for that field rather than
retaining the old one. ``--clear-existing`` requires ``--overwrite``.

Output
------

By default the output VCF is written to a ``.info.vcf`` file next to
your input VCF. Use ``--output-vcf`` to specify a different path.

Usage
-----

.. program-output:: vcf-info-annotator -h

Example Commands
----------------

Annotate a single Float field:

.. code-block:: none

vcf-info-annotator input.vcf annotations.tsv \
-m "freq:FREQ:Float:Population allele frequency" \
-o output.vcf

Annotate two fields in one run:

.. code-block:: none

vcf-info-annotator input.vcf annotations.tsv \
-m "freq:FREQ:Float:Allele frequency,class:CVCLASS:String:ClinVar classification" \
-o output.vcf

Include source and version in the INFO header:

.. code-block:: none

vcf-info-annotator input.vcf annotations.tsv \
-m "freq:FREQ:Float:Population frequency:dbSNP:156" \
-o output.vcf

Overwrite an existing field, clearing it from records not in the TSV:

.. code-block:: none

vcf-info-annotator input.vcf annotations.tsv \
-m "score:MQ0:Integer:Mapping quality score" \
-w --clear-existing \
-o output.vcf
Loading
Loading