Coder Social home page Coder Social logo

morriyaty / panpop Goto Github PK

View Code? Open in Web Editor NEW

This project forked from starskyzheng/panpop

0.0 0.0 0.0 63.1 MB

Application of graph-pan-genome to population based on resequnces

License: MIT License

Shell 30.45% C++ 1.73% Python 12.93% Perl 45.75% C 3.68% Rust 0.08% Makefile 0.04% Roff 5.35%

panpop's Introduction

PanPop

Application of graph-based pan-genome to population based on resequnces.

Panpop aims to combination genome pan-graph and population resequnces together. By using graph-genome to get more accurate population SVs and even SNPs in standard VCF format, which is convenient for subsequent analysis.
PanPop contains two mode. genotype mode will genotype SVs in the graph-genome which means no novel SVs nor SNPs were available. augment mode will also extract novel SNPs and SVs for each sample.
PanPop compatibility to single server, ssh-based cluster, slurm-based cluster, grid and cloud environments. Most part of PanPop was written in parallel which is also easy to install.

Usage

A graph-based genome in GFA format and resequnce reads were needed. Reference gfa should be renamed as Ref.gfa. Created a list file contains paired-end seq reads (like example/1.sample.reads.list). Modify workdir and sample_reads_list_file in config.yaml. You are ready to go!

Example:

git clone https://github.com/StarSkyZheng/panpop.git
cd panpop
cp config.x64.yaml config.yaml
snakemake -j 3 --reason --printshellcmds

Results located in example/5.final_result or example/9.aug_final_result for augment mode.

Installation:

Dependencies: python3 & perl>=5.24

pip3 install --user snakemake
curl -L https://cpanmin.us | perl - App::cpanminus
cpanm Data::Dumper MCE::Flow MCE::Candy MCE::Channel MCE::Shared Getopt::Long List::Util Carp File::Spec YAML Tie::CharArray IPC::Open2 File::Temp  

Parameters:

All parameters were defined in config.yaml file

Basic Parameters

workdir: Dir contains sample_read.list and will store result files.

sample_reads_list_file: File name of sample_read.list. Must located in wirkdir

split_chr: (False or True). Weather split by chromosome for parallel running. This is useful for multi-node clster. This option can greatly reduce the memory usage in augment mode. If you only own one computer-node the recommended setting is False. Default is False.

mode: (genotype or augment). genotype mode will genotype SVs in the graph-genome. augment mode will extract novel SNPs and SVs for each sample. Default is genotype.

Filter Parameters:

MAP_MINQ: minimal mapping-quality of each reads for giraffe-mapping in VG. Default is 5

MAF: Remove alleles than lower than this value. This is necessary and will greatly reduce the complexity of SVs. Default is 0.01

max_missing_rate: Max missing rate allowed for each SNPs and SVs. 0 means no missing allowed and 1 means not filter by missing. Default is 0.3 (remove this site if more than 30% samples were missing)

dp_min_fold & dp_max_fold: Hard-filter based on depth. Min depth for each site is XX fold of mean depth of this sample. Default is 1/3 ~ 3

mad_min_fold: Hard-filter based on depth: Min MAD(Minimum site allele depth) for each site is XX fold of mean depth of this sample. Default is 1/3 * 1/3

Realign Parameters:

realign_extend_bp_max & realign_extend_bp_min: During realign progress, the near by SNP/SVs will be merged together before realign. Values that are too low or too large may lose some accuracy. Default is 10 and 1.

More parameters

SV_min_length: Minimal length for SVs. Variat more than one base and smaller than this value will treated as InDels. Default is 50.

realign_max_try_times_per_method: Max try-times of each align software. Default is 3.

memory_tmp_dir: Temprory directory in memory. Must be a very fast disk. Left space can be smaller than 100Mb. Default is /run/user/USERID

mapper: Maping algorithm. Can be either 'map' or 'gaffe'

aug_nonmut_min_cov: In augment mode, if the coverage of SV in reference allele is greater than this value will be treated as exists. Note this value is the first filter parameter, the further filter based on depth will be perfomed. Defalut is 0.8.

aug_nomut_min_dp: In augment mode, if the average of depth of SV in reference allele is greater than this value will be treated as exists. Note this value is the first filter parameter, the further filter based on depth will be perfomed. Default is 3.

Notice

Graph-genome should be GFA format and the name of chromosome must be specified.
Chromesome name cannot be all.
Backbone genome should be numeric only. Non-Backbone genome should NOT be numeric only.
scripts/build_graph.pl can be used to build suitable gfa format from genome fasta files.

Citations

This software used HAlign, bcftools, vg, muscle and snakemake software:

  • [1] Shixiang Wan and Quan Zou, HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing, Algorithms for Molecular Biology, 2017, 12:25.
  • [2] Danecek P, Bonfield JK, et al. Twelve years of SAMtools and BCFtools. Gigascience (2021) 10(2):giab008.
  • [3] Hickey, G., Heller, D., Monlong, J. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol 21, 35 (2020).
  • [4] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797.
  • [5] Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.

panpop's People

Contributors

starskyzheng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.