MutEnricher

Author: Anthony R. Soltis ([email protected], [email protected])

Institution: Uniformed Services University of the Health Sciences, Bethesda, MD

License: MIT License, see License

Version: 1.3.3

Introduction:

MutEnricher is a flexible toolset that performs somatic mutation enrichment analysis of both protein-coding and non-coding genomic loci from whole genome sequencing (WGS) data, implemented in Python and usable with Python 2 and 3.

MutEnricher is now also available as a Docker image.

MutEnricher contains two distinct modules:

coding - for performing somatic enrichment analysis of non-silent variation in protein-coding genes
noncoding - for performing enrichment analysis of non-coding regions

The main driver script is mutEnricher.py and each tool can be evoked from here, i.e.:

python mutEnricher coding ...
python mutEnricher noncoding ...

See help pages and associated documentation for methodological and run details.

Citation:

A MutEnricher manuscript is now published in BMC Bioinformatics. Please cite if using this software:

Soltis, A.R., Dalgard, C.L., Pollard, H.B., & Wilkerson, M.D. MutEnricher: a flexible toolset for somatic mutation enrichment analysis of tumor whole genomes. BMC Bioinformatics (2020). 20(1).

Info and User Guides:

Wiki

Quickstart guide

Tutorial

Output file descriptions

Installation:

See Installation Guide section on Wiki.

Additional utilities

In the "utilities" sub-directory, we include two helper functions for generating covariate files for use with MutEnricher's covariate clustering functions:

1. get_gene_covariates.py  
2. get_region_covariates.py

See the help pages for example usage. (1) above requires GTF input (as for the coding module) and (2) requires and input BED (as for the noncoding module). Both also require a copy of an indexed genome FASTA file (e.g. for hg19/hg38 human genomes) as input.

Example data

We include various example files for testing MutEnricher on synthetic somatic data. See the "example_data" sub-folder.

Several quickstart commands are provided in example_data/quickstart_commands.txt file. A sample quickstart command for coding analysis:

cd example_data
python ../mutEnricher.py coding annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt --anno-type nonsilent_terms.txt -o test_out_coding --prefix test_global

Files/folders contained in example_data:

example_data/annotation_files

Contains example GTF and BED files for running MutEnricher's coding and noncoding modules.
- ucsc.refFlat.20170829.no_chrMY.gtf.gz
- ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed
NOTE: Input GTF (coding analysis) and BED files (noncoding analysis) can be gzip compressed or not.
example_data/covariates

Contains example covariate and covariate weights files for running the covariate clustering background method:

For coding:
- ucsc.refFlat.20170829.no_chrMY.covariates.txt
- ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt
For noncoding:
- ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariates.txt
- ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariate_weights.txt
nonsilent_terms.txt

Example non-silent terms file for use with coding module. This example is applicable to VCFs annotated with ANNOVAR refGene models (the sample VCFs are annotated in this way). Use with the --anno-type option in the coding module.

NOTE: These same terms will be used if "annovar" is passed to the --anno-type option.
precomputed_apcluster

This folder provides pre-computed affinity propagation results for the datasets in (1) and (2) above. These directories can be supplied to MutEnricher via the --precomputed-covars option.

For coding (all genes):
- coding.ucsc.refFlat.20170829.no_chrMY/all_genes
For noncoding:
- noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions
quickstart_commands.txt

Sample execution commands (associated with quickstart guide).
vcf_files.txt

Sample VCF input files list file. This file contains local paths and assumes working directory is "example_data" sub-directory.
vcfs

Sub-directory containing 100 synthetic somatic VCF files (compressed with index .tbi files). These files were generated by randomly inserting "somatic mutations" at positions in the hg19 genome at a target rate of ~2 mutations/Mb. Three true positive cases are included, two coding and one non-coding, whereby non-silent mutations were inserted into the TP53 and KRAS genes and somatic mutations were inserted into the TERT gene promoter region.

Change log

06-15-2021

Version 1.3.3
Updates:
- Include VEP annotation parsing capabilities (via "CSQ" field) in coding module.
- Included missing function in coding analysis code to parse blacklist variant input file.

05-11-2021

Version 1.3.2
Updates:
- Included SnpEff annotation parsing capabilities (via "ANN" INFO field) in coding module. Set --anno-type options to 'SnpEff' to use pre-set annotations compatible with this tool.
- Improved error handling for interval files and regions in covariate utility scripts.

10-01-2020

Version 1.3.1
Bug fix:
- Update to coding module and gene covariate code to address incomplete merging of overlapping gene feature intervals (exons, CDS).

06-10-2020

Dockerfile added for creation of Docker image.
No code updates.

10-23-2019

Version 1.3.0
Major updates:
- 'nsamples' (binomial testing method) is now default statistical testing (--stat-type) option.
- Combined covariate clustering plus local background rate method implemented. When covariates are supplied and --use-local is also set, programs compute local backgrounds around features part of clusters during background calculations.

10-10-2019

Version 1.2.1
Minor update to local background method, whereby minimum search window is increased to 1 Mb.

09-13-2019

Version 1.2.0
Major updates:
- Code updated for compatibility with Python 3.
- Included --stat-type option to select between original negative binomial test based on mutation counts (nmutations, default) or binomial test on number of mutated samples (nsamples).
Minor updates:
- Updated --anno-type preset options to better reflect various ANNOVAR gene annotations.
- Deprecated --repliseq-fns option in utilities code and updated to -i/--interval-files option

03-25-2019

Version 1.1.3
Updates:
- Noncoding code now produces _region_WAP_hotspot_Fisher_enrichments.txt output file, which includes an overall combined Fisher's combined p-value for the overall region, WAP, and hotspot (if present) p-values.

02-12-2019

Version 1.1.2
Updates:
- In both coding and noncoding modules, new option --min-hs-samps included for setting minimum number of samples that must contain mutations in a candidate hotspot region for subsequent testing. Default is set to 2; setting to 1 is equivalent to prior default behavior.

01-15-2019

Version 1.1.1
Updates/bug fixes:
- Coding analysis code now produces output file with combined Fisher p-value for overall gene and hotspot(s) enrichments.
- Updated method used to compute Fisher p-values for better numerical accuracy.
- utilities/get_gene_covariates.py updated to read gzipped GTF files.
- Fixed minor bug in coding analysis code associated with local background rate calculation method.
- Updated coding analysis code to calculate gene background mutation rate from samples possessing at least one non-silent mutation.

06-15-2018

Initial release; The development of this Software was sponsored by the Uniformed Services University of the Health Sciences (USU); however, the information or content and conclusions do not necessarily represent the official position or policy of, nor should any official endorsement be inferred on the part of, USU, the Department of Defense, or the U.S. Government.

Setting up covariates

Hi there,
Congrats on a wonderful package. And thank you for the most comprehensive tutorial! Much appreciated.
I am trying to derive my own set of covariates for hg38.
I have downloaded repliseq data for neural progenitor cells from https://www2.replicationdomain.com/database.php#
And used this command to generate covariates

python ../../utilities/get_region_covariates.py Annotated_promoter_HOMER.bed hg38.fa -i interval_files.txt 
Loading regions...
Loaded 19227 regions from input BED file.
  Divided 19227 regions into 20 region chunks.

However nothing seemed to happen for very very long (>1 hour)... so i had to abort

cat interval_files.txt 
RT_BG01_NPC_hg38.bedgraph	NPC_RT	

head Annotated_promoter_HOMER.bed
chr1	68091	69191	OR4F5	NM_001005484	ENSG00000186092	ENST00000641515	promoter-TSS	1100
chr1	451578	452678	OR4F3	NM_001005224	ENSG00000230178	ENST00000456475	promoter-TSS	1100
chr1	451578	452678	OR4F29	NM_001005221	ENSG00000284733	ENST00000426406	promoter-TSS	1100
chr1	451578	452678	OR4F16	NM_001005277	ENSG00000284662	ENST00000332831	promoter-TSS	1100
chr1	924731	925831	SAMD11	NM_152486	ENSG00000187634	ENST00000342066	promoter-TSS	1100
chr1	959156	960256	NOC2L	NM_015658	ENSG00000188976	ENST00000327044	promoter-TSS	1100
chr1	959584	960684	KLHL17	NM_198317	ENSG00000187961	ENST00000338591	promoter-TSS	1100
chr1	965482	966582	PLEKHN1	NM_001160184	ENSG00000187583	ENST00000379407	promoter-TSS	1100
chr1	981073	982173	PERM1	NM_001369898	ENSG00000187642	ENST00000433179	promoter-TSS	1100
chr1	999997	1001097	HES4	NM_001142467	ENSG00000188290	ENST00000428771	promoter-TSS	1100

head RT_BG01_NPC_hg38.bedgraph 
chr1	100000747	100000806	0.713487
chr1	100002056	100002115	0.714855
chr1	100003760	100003819	0.716606
chr1	100004249	100004308	0.717102
chr1	100005177	100005236	0.718038
chr1	100006848	100006907	0.719705
chr1	100007434	100007493	0.720284
chr1	100008600	100008659	0.721431
chr1	100009671	100009730	0.722477
chr1	10001056	10001115	1.16089

Could you help me troubleshoot this please?
Thank you.
A

asoltis / mutenricher Goto Github PK

mutenricher's Introduction