Coder Social home page Coder Social logo

mutenricher's Introduction

MutEnricher


Author: Anthony R. Soltis ([email protected], [email protected])

Institution: Uniformed Services University of the Health Sciences, Bethesda, MD

License: MIT License, see License

Version: 1.3.3

Introduction:

MutEnricher is a flexible toolset that performs somatic mutation enrichment analysis of both protein-coding and non-coding genomic loci from whole genome sequencing (WGS) data, implemented in Python and usable with Python 2 and 3.

MutEnricher is now also available as a Docker image.

MutEnricher contains two distinct modules:

  1. coding - for performing somatic enrichment analysis of non-silent variation in protein-coding genes
  2. noncoding - for performing enrichment analysis of non-coding regions

The main driver script is mutEnricher.py and each tool can be evoked from here, i.e.:

  1. python mutEnricher coding ...
  2. python mutEnricher noncoding ...

See help pages and associated documentation for methodological and run details.

Citation:

A MutEnricher manuscript is now published in BMC Bioinformatics. Please cite if using this software:

Soltis, A.R., Dalgard, C.L., Pollard, H.B., & Wilkerson, M.D. MutEnricher: a flexible toolset for somatic mutation enrichment analysis of tumor whole genomes. BMC Bioinformatics (2020). 20(1).

Info and User Guides:

Wiki

Quickstart guide

Tutorial

Output file descriptions

Installation:

See Installation Guide section on Wiki.

Additional utilities

In the "utilities" sub-directory, we include two helper functions for generating covariate files for use with MutEnricher's covariate clustering functions:

1. get_gene_covariates.py  
2. get_region_covariates.py

See the help pages for example usage. (1) above requires GTF input (as for the coding module) and (2) requires and input BED (as for the noncoding module). Both also require a copy of an indexed genome FASTA file (e.g. for hg19/hg38 human genomes) as input.

Example data

We include various example files for testing MutEnricher on synthetic somatic data. See the "example_data" sub-folder.

Several quickstart commands are provided in example_data/quickstart_commands.txt file. A sample quickstart command for coding analysis:

cd example_data
python ../mutEnricher.py coding annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt --anno-type nonsilent_terms.txt -o test_out_coding --prefix test_global

Files/folders contained in example_data:

  1. example_data/annotation_files

    Contains example GTF and BED files for running MutEnricher's coding and noncoding modules.

    • ucsc.refFlat.20170829.no_chrMY.gtf.gz
    • ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed

    NOTE: Input GTF (coding analysis) and BED files (noncoding analysis) can be gzip compressed or not.

  2. example_data/covariates

    Contains example covariate and covariate weights files for running the covariate clustering background method:

    For coding:

    • ucsc.refFlat.20170829.no_chrMY.covariates.txt
    • ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt

    For noncoding:

    • ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariates.txt
    • ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariate_weights.txt
  3. nonsilent_terms.txt

    Example non-silent terms file for use with coding module. This example is applicable to VCFs annotated with ANNOVAR refGene models (the sample VCFs are annotated in this way). Use with the --anno-type option in the coding module.

    NOTE: These same terms will be used if "annovar" is passed to the --anno-type option.

  4. precomputed_apcluster

    This folder provides pre-computed affinity propagation results for the datasets in (1) and (2) above. These directories can be supplied to MutEnricher via the --precomputed-covars option.

    For coding (all genes):

    • coding.ucsc.refFlat.20170829.no_chrMY/all_genes

    For noncoding:

    • noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions
  5. quickstart_commands.txt

    Sample execution commands (associated with quickstart guide).

  6. vcf_files.txt

    Sample VCF input files list file. This file contains local paths and assumes working directory is "example_data" sub-directory.

  7. vcfs

    Sub-directory containing 100 synthetic somatic VCF files (compressed with index .tbi files). These files were generated by randomly inserting "somatic mutations" at positions in the hg19 genome at a target rate of ~2 mutations/Mb. Three true positive cases are included, two coding and one non-coding, whereby non-silent mutations were inserted into the TP53 and KRAS genes and somatic mutations were inserted into the TERT gene promoter region.

Change log


06-15-2021

  • Version 1.3.3
  • Updates:
    • Include VEP annotation parsing capabilities (via "CSQ" field) in coding module.
    • Included missing function in coding analysis code to parse blacklist variant input file.

05-11-2021

  • Version 1.3.2
  • Updates:
    • Included SnpEff annotation parsing capabilities (via "ANN" INFO field) in coding module. Set --anno-type options to 'SnpEff' to use pre-set annotations compatible with this tool.
    • Improved error handling for interval files and regions in covariate utility scripts.

10-01-2020

  • Version 1.3.1
  • Bug fix:
    • Update to coding module and gene covariate code to address incomplete merging of overlapping gene feature intervals (exons, CDS).

06-10-2020

  • Dockerfile added for creation of Docker image.
  • No code updates.

10-23-2019

  • Version 1.3.0
  • Major updates:
    • 'nsamples' (binomial testing method) is now default statistical testing (--stat-type) option.
    • Combined covariate clustering plus local background rate method implemented. When covariates are supplied and --use-local is also set, programs compute local backgrounds around features part of clusters during background calculations.

10-10-2019

  • Version 1.2.1
  • Minor update to local background method, whereby minimum search window is increased to 1 Mb.

09-13-2019

  • Version 1.2.0
  • Major updates:
    • Code updated for compatibility with Python 3.
    • Included --stat-type option to select between original negative binomial test based on mutation counts (nmutations, default) or binomial test on number of mutated samples (nsamples).
  • Minor updates:
    • Updated --anno-type preset options to better reflect various ANNOVAR gene annotations.
    • Deprecated --repliseq-fns option in utilities code and updated to -i/--interval-files option

03-25-2019

  • Version 1.1.3
  • Updates:
    • Noncoding code now produces _region_WAP_hotspot_Fisher_enrichments.txt output file, which includes an overall combined Fisher's combined p-value for the overall region, WAP, and hotspot (if present) p-values.

02-12-2019

  • Version 1.1.2
  • Updates:
    • In both coding and noncoding modules, new option --min-hs-samps included for setting minimum number of samples that must contain mutations in a candidate hotspot region for subsequent testing. Default is set to 2; setting to 1 is equivalent to prior default behavior.

01-15-2019

  • Version 1.1.1
  • Updates/bug fixes:
    • Coding analysis code now produces output file with combined Fisher p-value for overall gene and hotspot(s) enrichments.
    • Updated method used to compute Fisher p-values for better numerical accuracy.
    • utilities/get_gene_covariates.py updated to read gzipped GTF files.
    • Fixed minor bug in coding analysis code associated with local background rate calculation method.
    • Updated coding analysis code to calculate gene background mutation rate from samples possessing at least one non-silent mutation.

06-15-2018

  • Initial release; The development of this Software was sponsored by the Uniformed Services University of the Health Sciences (USU); however, the information or content and conclusions do not necessarily represent the official position or policy of, nor should any official endorsement be inferred on the part of, USU, the Department of Defense, or the U.S. Government.

mutenricher's People

Contributors

asoltis avatar

Stargazers

 avatar  avatar Samir Ali avatar  avatar  avatar  avatar  avatar  avatar Konstantinos Kyriakidis avatar

mutenricher's Issues

Setting up covariates

Hi there,
Congrats on a wonderful package. And thank you for the most comprehensive tutorial! Much appreciated.
I am trying to derive my own set of covariates for hg38.
I have downloaded repliseq data for neural progenitor cells from https://www2.replicationdomain.com/database.php#
And used this command to generate covariates

python ../../utilities/get_region_covariates.py Annotated_promoter_HOMER.bed hg38.fa -i interval_files.txt 
Loading regions...
Loaded 19227 regions from input BED file.
  Divided 19227 regions into 20 region chunks.

However nothing seemed to happen for very very long (>1 hour)... so i had to abort

cat interval_files.txt 
RT_BG01_NPC_hg38.bedgraph	NPC_RT	

head Annotated_promoter_HOMER.bed
chr1	68091	69191	OR4F5	NM_001005484	ENSG00000186092	ENST00000641515	promoter-TSS	1100
chr1	451578	452678	OR4F3	NM_001005224	ENSG00000230178	ENST00000456475	promoter-TSS	1100
chr1	451578	452678	OR4F29	NM_001005221	ENSG00000284733	ENST00000426406	promoter-TSS	1100
chr1	451578	452678	OR4F16	NM_001005277	ENSG00000284662	ENST00000332831	promoter-TSS	1100
chr1	924731	925831	SAMD11	NM_152486	ENSG00000187634	ENST00000342066	promoter-TSS	1100
chr1	959156	960256	NOC2L	NM_015658	ENSG00000188976	ENST00000327044	promoter-TSS	1100
chr1	959584	960684	KLHL17	NM_198317	ENSG00000187961	ENST00000338591	promoter-TSS	1100
chr1	965482	966582	PLEKHN1	NM_001160184	ENSG00000187583	ENST00000379407	promoter-TSS	1100
chr1	981073	982173	PERM1	NM_001369898	ENSG00000187642	ENST00000433179	promoter-TSS	1100
chr1	999997	1001097	HES4	NM_001142467	ENSG00000188290	ENST00000428771	promoter-TSS	1100

head RT_BG01_NPC_hg38.bedgraph 
chr1	100000747	100000806	0.713487
chr1	100002056	100002115	0.714855
chr1	100003760	100003819	0.716606
chr1	100004249	100004308	0.717102
chr1	100005177	100005236	0.718038
chr1	100006848	100006907	0.719705
chr1	100007434	100007493	0.720284
chr1	100008600	100008659	0.721431
chr1	100009671	100009730	0.722477
chr1	10001056	10001115	1.16089

Could you help me troubleshoot this please?
Thank you.
A

Installation, where is the math_funcs directory?

Hello,

I'd first like to say this tool is very well documented and it is perfect for what I need it to do. The only problem I'm having trouble installing in... I am on the Cythonize math functions code step and I am having trouble finding the math_funcs directory. I assume it is a subdirectory in the library package Cython but I cannot find it.

Is it possible I can get helped out?

Thanks

non-coding with covariates message: re-running unfinished contigs... indefinitely?

Hi - awesome program. I ran MutEnricher with docker without covariates successfully. Now I've added covariates, and it's been running for over a week without completing on 10 processors with 120GB of RAM. I'm scanning 18000 ~300 base regions in 75 samples. Can you suggest how I can tell whether it's in a loop or making progress? Thanks!

Messages from the last week:

re-running unfinished contigs: ['chr1', 'chr2']
  chr2 done.
  chr1 done.
  re-running unfinished contigs: ['chr1', 'chr2']

ZeroDivisionError: division by zero

Hi there,

I tried to create a covariate file using a processed bed file from GencodeV38, using the command below and I got this error. Could you please help me troubleshoot?

python ../utilities/get_region_covariates.py ../../Elements/Annotations/fiveprimeGENCODEv38.bed ../../Elements/Genome/hg38.fa --interval-files covariates/interval_files.txt -p 12 -o fiveprime_covariates.txt
Loading regions...
Loaded 91302 regions from input BED file.
Traceback (most recent call last):
  File "../utilities/get_region_covariates.py", line 322, in <module>
    if __name__ == '__main__': main()
  File "../utilities/get_region_covariates.py", line 104, in main
    r.get_seq_gc_cont()
  File "../utilities/get_region_covariates.py", line 289, in get_seq_gc_cont
    GCcont = (numC+numG) / tot
ZeroDivisionError: division by zero

head ../../Elements/Annotations/fiveprimeGENCODEv38.bed
chr1	65419	65433	OR4F5
chr1	450740	450742	OR4F29
chr1	685679	685718	OR4F16
chr1	686655	686673	OR4F16
chr1	923923	924431	SAMD11
chr1	923923	924431	SAMD11
chr1	925150	925189	SAMD11
chr1	925731	925800	SAMD11
chr1	959241	959256	NOC2L
chr1	960584	960693	KLHL17

I'm not sure what's causing this error- happy to provide you with the bed file if you think it'd be useful.

A

Job won't run to completion

Hello,

I really like your tool but I was wondering why my jobs won't run to completion. It always stops at Performing weighted average proximity (WAP) hotspot enrichments... Is it possible you can help me to figure why this is the case?

The commands I run are:

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_Nmasked.txt -o Cd1_Nmasked_Intron_noncoding --prefix test_global -p 5

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_snps_filtered.txt -o Cd1_snps_filtered_final_Intron_noncoding --prefix test_global -p 5

My bed file is here (I changed it to .txt form so I could upload):
the bedfile was sorted and merged
Introns_merged_mm10.bed.txt
head:
chr1 3216968 3421701 1_Intron
chr1 3421901 3670551 2_Intron
chr1 4120073 4142611 3_Intron
chr1 4142766 4147811 4_Intron
chr1 4147963 4148611 5_Intron
chr1 4148744 4163854 6_Intron
chr1 4163941 4170204 7_Intron
chr1 4170404 4197533 8_Intron
chr1 4197641 4206659 9_Intron
chr1 4206837 4226610 10_Intron

My vcf.txt files are here:
/data2/samir/SNP_Hotspot/vcf/Cd1_Nmasked_final_sorted.mm10.vcf.gz Cd1_Nmasked
/data2/samir/SNP_Hotspot/vcf/Cd1_snps_filtered_final_sorted.mm10.vcf.gz Cd1_snps_filtered

I bgzipped, sorted, and indexed them with bgzip and bcftools.

run.txt

run2.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.