Coder Social home page Coder Social logo

chipseq-smk-pipeline's Introduction

JetBrains Research

chipseq-smk-pipeline

Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.

Scheme

During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.

Input

Input FASTQ files

Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.

Input BAM files

Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.

Files

Path Description
config.yaml Default pipeline options
trimmed Trimmed FASTQ file, if trim_reads option is True.
bams BAMs with aligned reads, MAPQ >= 30
bw BAM coverage visualization using DeepTools
macs2 MACS2 peaks
sicer SICER peaks
span SPAN peaks
qc QC Reports
multiqc MultiQC reports for different steps
logs Shell commands logs

Requirements

The pipeline requires conda.

  • If conda is not installed, follow the instructions at Conda website.
  • Navigate to repository directory.

Create a Conda environment for snakemake:

$ conda env create --file environment.yaml --name snakemake

Activate the newly created environment:

$ source activate snakemake

On Ubuntu please ensure that gawk is installed:

$ sudo apt-get install gawk

Launch

Run the pipeline to start with fastq reads:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> --rerun-incomplete

Default pipeline doesn't launch peak callers. Please add macs2=True, sicer=True, span=True to call peaks with MACS2, SICER or SPAN.

To launch MACS2 in --broad mode, use the following config:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
    --rerun-incomplete

See config.yaml for complete list of parameters. Use--config to override default options from config.yaml file.

QSUB

Configure profile for qsub with Torque scheduler with name generic_qsub

$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.git

Example of ATAC-Seq processing on qsub

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --directory <work_dir> \
    --profile generic_qsub --cluster-config qsub_config.yaml --jobs 150 \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
    span=True span_params="--fragment 0" bowtie2_params="-X 2000 --dovetail"  --rerun-incomplete

P.S: Use --config to override default options from config.yaml file

Try with test data

Please download example fastq.gz files from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.

Launch chipseq-smk-pipeline:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --cores all --directory <work_dir> \
    --config fastq_ext=fastq.gz fastq_dir=<work_dir> genome=hg19 macs2=True sicer=True span=True \
    --rerun-incomplete

Useful links

chipseq-smk-pipeline's People

Contributors

dievsky avatar iromeo avatar olegs avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chipseq-smk-pipeline's Issues

Pipeline fails with _1 suffix in file name without corresponding _2 file

user@franklin:/mnt/stripe/shpynov/chipseq-smk-pipeline(master)$ snakemake all --use-conda --cores 28  --directory /mnt/stripe/bio/raw-data/geo-samples/GSE104284 --config fastq_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq genome=mm10 macs2_mode=broad macs2_suffix="broad_0.05" macs2_params="--broad --broad-cutoff 0.05" -n
KeyError in line 20 of /mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk:
'GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac_1'
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/Snakefile", line 55, in <module>
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk", line 32, in <module>
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk", line 20, in sicer_all_peaks_input

Workaround used at the moment:

mv /mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq/GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac_1.fastq /mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq/GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac.fastq

Check that raw data are available on start and warn user if not reads were found

The following command line fails on franklin:

cd /mnt/stripe/shpynov/chipseq-smk-pipeline
source activate snakemake
snakemake all --use-conda --cores 28  --config work_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE53643 fastq_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq genome=mm10 macs2_params="-q 0.05" span_bin=100 span_fdr=0.05 -n

But fastq_dir contains a lot of fastq files.

Bowtie fails

Activating conda environment: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292
Traceback (most recent call last):
  File "/mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/scripts/tmpuscfchbw.wrapper.py", line 25, in <module>
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/site-packages/snakemake/shell.py", line 133, in __new__
Traceback (most recent call last):
  File "/mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/scripts/tmpbjbxtwx7.wrapper.py", line 25, in <module>
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/site-packages/snakemake/shell.py", line 133, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' set -euo pipefail;  (bowtie2 --threads 4  -x bowtie2-index/mm10 -U /mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq/GSM1297956.fa
stq | samtools view -Sbh -o bams/GSM1297956.bam.raw -)  > logs/bam_raw/bowtie2/GSM1297956.log 2>&1 ' returned non-zero exit status 1.
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' set -euo pipefail;  (bowtie2 --threads 4  -x bowtie2-index/mm10 -U /mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq/GSM1297966.fa
stq | samtools view -Sbh -o bams/GSM1297966.bam.raw -)  > logs/bam_raw/bowtie2/GSM1297966.log 2>&1 ' returned non-zero exit status 1.
[Sat Oct 19 16:30:02 2019]
[Sat Oct 19 16:30:02 2019]
Error in rule bowtie2_align_single:
Error in rule bowtie2_align_single:
    jobid: 465
    jobid: 483
    output: bams/GSM1297956.bam.raw
    output: bams/GSM1297966.bam.raw
    log: logs/bam_raw/bowtie2/GSM1297956.log
    log: logs/bam_raw/bowtie2/GSM1297966.log
    conda-env: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292
    conda-env: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292


RuleException:
CalledProcessError in line 57 of /mnt/stripe/shpynov/chipseq-smk-pipeline/rules/alignment.smk:
Command 'source activate /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292; set -euo pipefail;  python /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.s
nakemake/scripts/tmpuscfchbw.wrapper.py ' returned non-zero exit status 1.
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/alignment.smk", line 57, in __rule_bowtie2_align_single
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.