jetbrains-research / chipseq-smk-pipeline Goto Github PK

ChIP-Seq processing pipeline on snakemake

Python 98.92% R 1.08%

atac-seq chip-seq peak-calling pipeline quality-control snakemake visualization

chipseq-smk-pipeline's Introduction

chipseq-smk-pipeline

Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.

During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.

Input

Input FASTQ files

Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.

Input BAM files

Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.

Files

Path	Description
`config.yaml`	Default pipeline options
`trimmed`	Trimmed FASTQ file, if `trim_reads` option is True.
`bams`	BAMs with aligned reads, `MAPQ >= 30`
`bw`	BAM coverage visualization using DeepTools
`macs2`	MACS2 peaks
`sicer`	SICER peaks
`span`	SPAN peaks
`qc`	QC Reports
`multiqc`	MultiQC reports for different steps
`logs`	Shell commands logs

Requirements

The pipeline requires conda.

If conda is not installed, follow the instructions at Conda website.
Navigate to repository directory.

Create a Conda environment for snakemake:

$ conda env create --file environment.yaml --name snakemake

Activate the newly created environment:

$ source activate snakemake

On Ubuntu please ensure that gawk is installed:

$ sudo apt-get install gawk

Launch

Run the pipeline to start with fastq reads:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> --rerun-incomplete

Default pipeline doesn't launch peak callers. Please add macs2=True, sicer=True, span=True to call peaks with MACS2, SICER or SPAN.

To launch MACS2 in --broad mode, use the following config:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
    --rerun-incomplete

See config.yaml for complete list of parameters. Use--config to override default options from config.yaml file.

QSUB

Configure profile for qsub with Torque scheduler with name generic_qsub

$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.git

Example of ATAC-Seq processing on qsub

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --directory <work_dir> \
    --profile generic_qsub --cluster-config qsub_config.yaml --jobs 150 \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
    span=True span_params="--fragment 0" bowtie2_params="-X 2000 --dovetail"  --rerun-incomplete

P.S: Use --config to override default options from config.yaml file

Try with test data

Please download example fastq.gz files from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.

Launch chipseq-smk-pipeline:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --cores all --directory <work_dir> \
    --config fastq_ext=fastq.gz fastq_dir=<work_dir> genome=hg19 macs2=True sicer=True span=True \
    --rerun-incomplete

Useful links

Learn more about Snakemake workflow management system
Developed with SnakeCharm plugin for PyCharm IDE by JetBrains Research BioLabs

chipseq-smk-pipeline's People

Contributors

Stargazers

Watchers

Forkers

chaodi51 thekingofall

user@franklin:/mnt/stripe/shpynov/chipseq-smk-pipeline(master)$ snakemake all --use-conda --cores 28  --directory /mnt/stripe/bio/raw-data/geo-samples/GSE104284 --config fastq_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq genome=mm10 macs2_mode=broad macs2_suffix="broad_0.05" macs2_params="--broad --broad-cutoff 0.05" -n
KeyError in line 20 of /mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk:
'GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac_1'
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/Snakefile", line 55, in <module>
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk", line 32, in <module>
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/sicer.smk", line 20, in sicer_all_peaks_input

Workaround used at the moment:

mv /mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq/GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac_1.fastq /mnt/stripe/bio/raw-data/geo-samples/GSE104284/fastq/GSM2794123_C36UVACXX.1.48hr_01_Left_K27ac.fastq

Check that raw data are available on start and warn user if not reads were found

The following command line fails on franklin:

cd /mnt/stripe/shpynov/chipseq-smk-pipeline
source activate snakemake
snakemake all --use-conda --cores 28  --config work_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE53643 fastq_dir=/mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq genome=mm10 macs2_params="-q 0.05" span_bin=100 span_fdr=0.05 -n

But fastq_dir contains a lot of fastq files.

Allow to run pipeline from workdir (with -s param) and not only from src dir

Bowtie fails

Activating conda environment: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292
Traceback (most recent call last):
  File "/mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/scripts/tmpuscfchbw.wrapper.py", line 25, in <module>
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/site-packages/snakemake/shell.py", line 133, in __new__
Traceback (most recent call last):
  File "/mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/scripts/tmpbjbxtwx7.wrapper.py", line 25, in <module>
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/site-packages/snakemake/shell.py", line 133, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' set -euo pipefail;  (bowtie2 --threads 4  -x bowtie2-index/mm10 -U /mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq/GSM1297956.fa
stq | samtools view -Sbh -o bams/GSM1297956.bam.raw -)  > logs/bam_raw/bowtie2/GSM1297956.log 2>&1 ' returned non-zero exit status 1.
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' set -euo pipefail;  (bowtie2 --threads 4  -x bowtie2-index/mm10 -U /mnt/stripe/bio/raw-data/geo-samples/GSE53643/fastq/GSM1297966.fa
stq | samtools view -Sbh -o bams/GSM1297966.bam.raw -)  > logs/bam_raw/bowtie2/GSM1297966.log 2>&1 ' returned non-zero exit status 1.
[Sat Oct 19 16:30:02 2019]
[Sat Oct 19 16:30:02 2019]
Error in rule bowtie2_align_single:
Error in rule bowtie2_align_single:
    jobid: 465
    jobid: 483
    output: bams/GSM1297956.bam.raw
    output: bams/GSM1297966.bam.raw
    log: logs/bam_raw/bowtie2/GSM1297956.log
    log: logs/bam_raw/bowtie2/GSM1297966.log
    conda-env: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292
    conda-env: /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292


RuleException:
CalledProcessError in line 57 of /mnt/stripe/shpynov/chipseq-smk-pipeline/rules/alignment.smk:
Command 'source activate /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.snakemake/conda/a30ef292; set -euo pipefail;  python /mnt/stripe/bio/raw-data/geo-samples/GSE53643/.s
nakemake/scripts/tmpuscfchbw.wrapper.py ' returned non-zero exit status 1.
  File "/mnt/stripe/shpynov/chipseq-smk-pipeline/rules/alignment.smk", line 57, in __rule_bowtie2_align_single
  File "/home/user/anaconda/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run