stevekm / reportit Goto Github PK

View Code? Open in Web Editor NEW

7.0 4.0 4.0 6.66 MB

IonTorrent variant reporting pipeline for clinical interpretation of cancer panel results

License: GNU General Public License v3.0

Python 60.56% Batchfile 0.27% Shell 35.27% R 3.90%

iontorrent iontorrent-server genes igv panel pipeline analysis visualizations reporting

reportit's Introduction

reportIT

IonTorrent Variant Reporting Pipeline

This program will annotate, aggregate, and summarize clinical variant information from the IonTorrent suite

A brief graphical overview of the pipeline can be found here.

Overview

Sequencing results are accessed directly from the lab's IonTorrent server via scripts using ssh and rsync. After downloading files for a given run to the local system, VCF formatted variant call files are annotated and summarized to identify mutations of known significance (ANNOVAR, bcftools), while BAM formatted coverage files are visualized with IGV. An HTML formatted report is generated from variant summary information and IGV snapshots. Results can be easily emailed to clinicians for review with the provided script.

In Progress

Development of the following items is currently planned for the future:

Per-sample reports showing variant summary table and clinical interpretation of variants supplied by the Weill Cornell Precision Medicine Knowledgebase.
Analysis review feedback system to mark sequencing artifacts and remove them from report output
Deposition of pipeline output in a central database (REDCap, or other)

Analysis Report Example

[[ A full HTML version of the report can be previewed here or here. ]]

An analysis overview report displays the significant variants found across all samples in the run. 'SC' sensitivity control samples are shown in a separate table (hidden by default).

IGV snapshots shown for all significant variants. For low frequency variants, a "long view" snapshot is included to ensure mutations can be seen amongst reads. If available, 'NC' control sample is included on the lower track.

Installation

First, clone this repo:

git clone --recursive https://github.com/stevekm/reportIT.git
cd reportIT

Then, run the dir_setup.sh script;

./dir_setup.sh

This should set up the bin and ref directories, along with creating and symlinking the external input, output, and data directories. You should verify these symlinks and directories, and then populate the data directory with files required for the pipeline (see the 'Data directory' section, below). You should also set up automatic ssh for your IonTorrent server as described here.

Review the items described in the Files & Directories and Pipeline Settings sections to make sure everything is set properly.

Usage

Check For Runs

Before you can run the pipeline, you need to know which runs are available on your IonTorrent server. The following steps require that your data/server_info.txt file is set correctly, as described below. It is also recommended to have ssh key authentication set up for your user account on the IonTorrent server.

Missing Runs Only

If you only want to know which runs on the IonTorrent server are not present on your local system, you can use this script:

code/check_for_new_runs.py

By default, it will validate each missing run to make sure that IonTorrent sequencing data has been produced in the remote run directory. It will also automatically create an unpaired samplesheet for the missing runs. If you inlclude the -d argument to the script, it will also download the missing runs entered on the sample sheet produced.

All Available Runs

The following script will log into your IonTorrent server, and output a list of all run directories found:

code/get_server_run_list.sh

Make Samplesheet

The best way to run the reportIT pipeline is by using a samplesheet. These are stored in the samplesheets directory by default, and an example can be found here. A samplesheet must be in TSV (tab-separated) format, preferably with one run ID per line. If two runs should be treated as a 'pair', then both run ID's should be on the same line. Note that paired run processing only affects report and IGV snapshot generation, not downloading or annotation.

The best way to create a samplesheet is to use the make_samplesheet.py script. This script can take any number of unpaired run ID's, and a single set of paired ID's.

code/make_samplesheet.py unpaired_run1 unpaired_run2 -p paired_run3.1 -p paired_run3.2

A samplesheet produced this way will look like this:

unpaired_run1
unpaired_run2
paired_run3.1	paired_run3.2

Run Pipeline

The simplest way to run the reportIT pipeline is to use the run_samplesheet.py script, and specify which actions you would like to take on the analysis ID's specified in your sample sheet. The following actions are available:

Download: -d
Annotate: -a
Report: -r

The following modifiers are available:

Submit to cluster with qsub: -q

Multiple actions can be combined in a single command:

# download all files, then annotate and report with qsub
$ code/run_samplesheet.py samplesheets/samplesheet.tsv -darq

NOTE: The -q method has been configured to work with the phoenix SGE compute cluster at NYULMC, and might need to be reconfigured to work on other HPC systems.

Mail Results

After manually reviewing the HTML report output, pipeline results can be delivered in a pre-formatted email using the following script:

code/mail_analysis_report.sh <analysis_ID>

Multiple analysis_ID's can be passed to email all results sequentially. All summary tables, VCF files, IGV snapshots, and the analysis overview report will be included as email attachments. Configuration for emailing is saved in the file mail_settings.sh.

Usage Notes

Rough estimates for pipeline completeion time are ~5-10 minutes to download all files and annotate variants, and ~5-15 minutes to create all IGV snapshots and generate reports. In total this comes to roughly 10 - 30 minutes per analysis, depending on the number of variants present.

Running the pipeline without the -q argument will run all pipeline steps for all analyses in the current session; if you plan to do this, you should probably run the pipeline in screen in order to allow it to run in the background indepedent of your terminal connection. Note that running with qsub is currently disabled for the file download step, so all file downloads will always run in the current session. If you have a lot of analyses, this might take a while.

Program Validation

As a safety feature against undesired usage, the run_samplesheet.py script includes self-validating features to make sure the following items are set correctly before running the pipeline:

check that the proper git branch is currently in use
check that the proper output directory has been symlinked

These validations can be skipped by adding the --debug argument to the script.

Files & Directories

Input, output, and reference data for the program is stored external to the program's directory and is set by symlinks. These should be automatically created when you run the dir_setup.sh script during initial installation.

Data directory

The data directory should contain the following items:

data/
|-- control_sample_IDs.txt
|-- control_sample_regex.txt
|-- email_recipients.txt
|-- actionable_genes.txt
|-- panel_genes.txt
|-- server_info.txt
`-- summary_fields.txt

Important files:

control_sample_IDs.txt: ID's for IonTorrent samples which are used to denote 'control' samples, which should not be used for IGV snapshots. Example:

NC
SC
NTC
NC HAPMAP
HAPMAP
Sc
SC-ACROMETRIX

control_sample_regex.txt: Regex patterns to use with grep -F in some scripts which try to identify NC control samples. Example:

^NC[[:space:]]
^NC[[:space:]]HAPMAP[[:space:]]
^HAPMAP[[:space:]]

SC_control_sample_IDs.txt: ID's to identify SC control samples. Example:

SC
Sc
SC-ACROMETRIX

email_recipients.txt: A list of email addresses to use in the To: field of the outgoing email with analysis results. The addresses must be on a single line, formatted as such:

[email protected], [email protected]

panel_genes.txt : A list of genes to be included in the gene panel. Example:

AKT1
ALK
APC
ATM

actionable_genes.txt : A list of genes determined to be actionable. Example:

BRAF
EGFR
FLT3

server_info.txt : The login info for the IonTorrent server. Must be formatted as such:

username@server_IP

References Directory

This directory contains the following items:

ref
|-- cannonical_transcript_table.py
`-- hg19
    |-- canonical_transcript_list.txt
    |-- download_refs.txt
    |-- kgXref.txt
    |-- kgXref.txt.gz
    |-- knownCanonical.txt
    `-- knownCanonical.txt.gz

Important files:

hg19/canonical_transcript_list.txt : A list of canonical transcripts to use with the given genome (hg19). Each gene for should have only one 'canonical transcript' given in the list. Example:

NR_026820
NM_001005484
NR_039983
NM_001005277

IDs_to_replace.csv: Transcript ID's that should be replaced in the default canonical_transcript_list.txt list during setup, in the format of OLD,NEW. Example:

NM_001276760,NM_000546

Pipeline Settings

Settings used by the pipeline have been saved in several files, for ease of access & modification.

filter_criteria.json: Filtering criteria used identify quality variants for inclusion in the variant summary tables. The sample file example_filter_criteria.json has been provided, and should be renamed to filter_criteria.json
global_settings.sh: Global pipeline locations and settings used by bash scripts. References to hard-coded locations on your local system or IonTorrent system should be reviewed, and updated to match your criteria.
global_settings.py: Many of the same settings as set in global_settings.sh, intended for use in Python scripts.
mail_settings.sh: Settings to use with the bash email script(s).

Control Samples

An important aspect of the IonTorrent reporting pipeline is the ability to recognize control samples included in a run. Unlike patient samples, these samples are included in a run for quality control purposes. Since the IonTorrent system is agnostic to the nature of samples in a run, these control samples must be denoted by their sample ID entered in the system during run setup. Similarly, the reporting pipeline is only able to identify which samples are controls by their sample ID. This makes sample labeling of control samples important during wet-lab IonTorrent run setup. The following control samples are typically used:

SC: Sensitivity control (positive control). This sample is expected to show a large number of mutations. Typically uses AcroMetrix Hotspot control sample.
NC: Negative control. DNA sample that should not have any mutations. Typically a HapMap sample.
NTC: No template control. No DNA included in the sample, only water.

The best practice is to label these control samples as SC, NC, and NTC in every run. The ID's to be used should be entered in the appropriate settings files as described in the section Files & Directories.

These control samples have special treatment when running the pipeline. During processing, IGV snapshots will not be taken for any sample that has a label matching a control sample. Instead, the pipeline will attempt to identify the NC control sample for a run, or pair of runs, and use it's corresponding .bam file as the lower track in IGV snapshots for all samples in the given run(s). During report generation, variants from the SC sample will be excluded from the primary variant summary table displayed, since it is expected to have a large number of variants.

Software Requirements

This program has been developed in a Linux environment running CentOS 6. Some scripts issue system commands which rely on standard GNU Linux utilities. The current list of all binary dependencies are contained in the file bin.txt. Some download notes for obtaining these programs can be found in bin_downloads.txt

- Python 2.7
- pandoc version 1.12.3 or higher
- R 3.3.0 or higher
- IGV_2.3.81
- bcftools-1.3.1
- htslib-1.3.1
- samtools-1.3.1
- ANNOVAR version 2015-06-17 21:43:53 -0700 (Wed, 17 Jun 2015)
- GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

Python packages

- numpy==1.11.0
- pandas==0.17.1

R packages

- rmarkdown
- optparse

reportit's People

Contributors

Stargazers

Watchers

Forkers

nyu-molecular-pathology haoziyeung ichobits pramodsshinde

reportit's Issues

paired log dir creation causes recursive symlinks

If the IGV reporting script is run repeatedly, a symlink 'paired_logs' gets created within a currently existing paired_logs dir, pointing back to itself.

WC PMKB Excel sheet variant parsing needed

In order to integrate clinical interpretations of variants from the WC PMKB database (here: https://pmkb.weill.cornell.edu/therapies/download.xlsx , https://pmkb.weill.cornell.edu/), the provided Excel sheet needs to be parsed to make searching for variants easier. Current scripts only provide TSV output format, but scripts need to be extended to split the dataset so each individiual variant has a single row, and the variant ID's match ANNOVAR output format.

Samples with no variants not appearing in final report

Looks like NTC sample with no variants are not showing up in the final report output, need to figure out if this would also affect other samples with no variants.

Save stdout and stderr logs from active session script usage

One of the advantages of submitting jobs to the HPC cluster to run with qsub is that the stdout and stderr gets logged; this is not the case when running the scripts in the active session. To get around this, need to look into a method to use tee or similar to catch all stdout and sterr printed to console when running the pipeline scripts 'interactively' so that these logs can be created.

IGV screenshot review feedback integration

A method needs to be developed to allow for manual review of IGV screenshot results by licensed clinician, followed by integration of the clinician's feedback on variants to keep or remove.

Need full analysis report

Need to develop report that shows results for all samples in an analysis. Include all IGV screenshots; don't include clinical interpretations.

need directory setup script

Since a lot of the required reference files and directories for the pipeline are stored external to the pipeline repo itself, a script needs to be created which will set up all of the necessary directories.

Variant Summary table needs to be reordered

Currently, no particular ordering is applied to the variant summary table which is displayed in the report. It has been requested that this should be ordered by sample Barcode value.

Wrong NC control sample being used for IGV snapshots

In one case so far, an unpaired analysis found and used an NC control sample BAM file from a completely different analysis directory and used it for the IGV snapshots. Need to verify NC control sample discovery steps and figure out why this happened. Analysis in question used qsub_report_wrapper.sh ....16-027.... and found NC BAM in analysis analysis 16-55-1 sample 8 instead.

GitHub info not being printed in the overview report

run_samplesheet with qsub script does not wait for annotations to finish

When the run_samplesheet.py script is used with qsub to submit pipeline jobs to the HPC cluster, it does not wait for all annotation jobs to completed before submitting IGV reporter jobs. This will result in failed IGV report jobs. Not sure there is an easy fix for this that will not break other parts of the pipeline. Might just have to run it again after all annotation jobs have finished. Currently advise to run the script twice, once for just annotations, and then again just for reporting.

missing deletion mutations from pipeline output

Some deletion variants which were previously identified manually from IonTorrent output may not have been included in reportIT pipeline output; need to verify whether or not this is the case, and if so adjust pipeline to ensure their inclusion. It is possible that filtering steps were put in place early during development to intentionally remove deletions, and system requirements have since changed.

Overview report variant table and IGV snapshots are not in sync

Right now, the variant summary table displayed in the overview report is taken from the summary table TSV file, while the snapshots come from whatever files are found in the IGV_snapshots directory. This is not ideal because it is possible for issues to arise in the snapshot creation, which can lead to variants from the table not showing up in the report. Need to redesign the report code such that these aspects are built in parallel and include some checking to make sure all variants in the table have their associated snapshots.

Need stdout logging for interactive scripts

Try to use something like tee to redirect a copy of the stdout stream from wrapper scripts to per-sample log dirs, in order to try to replicate the stdout logging from qsub scripts.

Need example output

Get a sample of the report and visualization output to add to the repo

'find' assertions and functions

GNU find is used frequently in the pipeline to search for files for processing; pipeline often depends on only a single result being returned. Need to develop function to verify that only one item was returned, can be bundled with the function to check that file/dir exists. Also develop function to run find with given criteria but only return 'n' number of results, refactor this into pipeline.

some variants are getting left out of IGV snapshots

in run 1627 the analysis overview report is missing some variants that show up in the summary table, also the NC control is not set correctly

split_df_col2cols function errors on too few columns

In the script merge_vcf_annotations.py, function split_df_col2cols gives an error if the number of fields output by merge_df['AAChange'].str.split(':') is fewer than the number of replacement columns (5 given by default).

need script to run pipeline from samplesheet

For development, and end-user usage, it will help to have a script that can read a samplesheet of analysis ID's for processing, and run the pipeline for each. Also need this to work with paired samples specified in the sheet.

Develop database of all found variants

Look into using SQL variant for this, likely SQLite. Consider which data per sample to be included, develop scripts to test database implementations.

Change overview report sample table caption

The current caption for the variant report sample table is:

Samples included in the analysis.

This needs to be changed to something like

Samples in the analysis which had variants

Because the report currently only displays samples if variants were found.

Need to download other files from server

Other sequencing run metadata files are available on the IonTorrent server. Added code to download other files but had to disable it as per 68cff6d because it was conflicting with annotation script steps. These files include reports, figures, etc., would be valuable to have, need to figure out a way to include them. Might require refactoring the output of annotation and IGV steps to save output in a separate directory instead of writing the files to the mirrored IT sample dirs.

Report script is making reports for samples with no variants

some samples with empty summary tables are still generating reports

sometimes the IGV reporting script gets hung up

Occasionally, when the IGV snapshot script gets run under a qsub cluster job, the script hangs indefinitely, seemingly just before IGV gets loaded. This has not been observed when running the scripts in the current session (non-qsub), may be caused by the cluster environment. Currently the only known solution is to simply kill the qsub job and re-run until it goes to completion, or forgo the qsub job submission.

IGV snapshot dirs are making recursive symlinks

Check the IGV snapshot setup script

include snapshot zip with email

Update the email script to include a Zip file with all of the IGV snapshot PNG files

(Xvfb) IGV snapshot script does not work in parrallel

Current implementation of the IGV snapshot Python script, which uses Xvfb to create a virtual X11 window for IGV to render into for creation of snapshots, only allows one instance of Xvfb to be running. As per email with Loren, this needs to be modified to:

You’d probably have to assign a new X server number to every instance of
Xvfb (and then, accordingly, $DISPLAY). Running Xvfb processes open a
socket in

/tmp/.X11-unix/

which you should be able to detect with standard filesystem means (ls,
test/[, …), so your job script could just look for a free number and use
that for the current job.

Until this is fixed, IGV snapshot script can only have one instance running at a time.

bin setup script does not unzip IGV correctly

Need pandoc bin download & setup info

Pipeline is currently using pandoc version 1.13.1 binary on the cluster, need to verify which is being used by R compile report script and find best instructions for bin download to include with project docs

Need new 'sparse report'

The 'sparse report' needs to be refactored to work with recent pipeline updates.

WC IPM knowledge base integration

Need to work on integrating data from the Weill Cornell IPM knowledge base in order to obtain clinical interpretations for detected variants. Two methods of use need to be implemented:

Online: use the API to query the knowledge base and return variant information
Offline: use the downloaded Excel sheet to match variants with clinical interpretations

Offline implementation has priority since we already have the data downloaded. Future offline implementations will use a formal database file provided by WC instead of an Excel sheet.

Create wiki entries

Some of the information from the front page README.md needs to be moved to wiki pages for the repo

IGV batchscripts need to be concatenated

IGV can only run one batch script with the -b argument, but multiple batch scripts can be concatenated together into a single script as long as the exit command is not used until the very end of the concatenated script. This will reduce time needed to take snapshots by not loading & exiting IGV repeatedly. Need a script to parse and concatenate the IGV batchscripts in an analysis directory for this purpose. Also to be safe, need to modify IGV batch scripts to use full paths to all dirs & bam files

IGV_snapshot_parser.sh not passing open X server value correctly

It looks like the IGV_snapshot_parser.sh script is not passing the value for the open X server correctly to the run_IGV_batchscript.py script. Proposed fix is to have the latter script find the server value itself using code from IGV-snapshot-automator. Really this entire section of the pipeline needs to be refactored and replaced with that script but its going to take some work to fix.

Need to account for IonTorrent sequencing artifacts; multiple variants at locus

Sometimes a variant is called in a sample, but similar variants are also present in the control sample for the run. Consider one or more of the following:

for every sample, look for similar variants in control sample
for IonTorrent VCF variant entries that have multiple called variants at a single location, develop system of flagging these variants as possible artifacts
The default ~500 height IGV snapshots might not be tall enough to see the presence of indels and other artifacts throughout the reads at a locus, consider generating extra long ~5000 height snapshots for these variants; link to report?
need to develop system for quickly & easily removing these from the report & summary tables upon manual review and feedback from clinicians

report for covid19 data

Hi, Will this pipeline work with covid19 fasta files, and BAM files? If not how do we customize it to generate a report?

Need to integrate tumor & tissue type

Find information on tumor and tissue types per sample and integrate them into the report, use a criteria to match with correct IPMKB clinical interpretations

SQL integration in pipeline

Entire pipeline should be refactored to include loading of all variants and sample data for an analysis, including IGV snapshots, into a single database file. Query this file for reporting instead of copying and symlinking TSV tables all over the place. Make a new repo branch to start development of this method. Use SQLite, look into methods to query SQLite in R to extract IGV snapshot blob and insert into R Markdown report. This will increase scalability and make the pipeline less error prone.