Coder Social home page Coder Social logo

nf-core / methylseq Goto Github PK

View Code? Open in Web Editor NEW
137.0 143.0 133.0 20.49 MB

Methylation (Bisulfite-Sequencing) analysis pipeline using Bismark or bwa-meth + MethylDackel

Home Page: https://nf-co.re/methylseq

License: MIT License

HTML 1.49% Python 5.65% Nextflow 74.49% Groovy 16.76% TeX 1.61%
nf-core nextflow workflow bisulfite-sequencing dna-methylation methyl-seq pipeline em-seq epigenome epigenomics

methylseq's Introduction

nf-core/methylseq nf-core/methylseq

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower nf-test

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Introduction

nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible.

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline Summary

The pipeline allows you to choose between running either Bismark or bwa-meth / MethylDackel. Choose between workflows by using --aligner bismark (default, uses bowtie2 for alignment), --aligner bismark_hisat or --aligner bwameth.

Step Bismark workflow bwa-meth workflow
Generate Reference Genome Index (optional) Bismark bwa-meth
Merge re-sequenced FastQ files cat cat
Raw data QC FastQC FastQC
Adapter sequence trimming Trim Galore! Trim Galore!
Align Reads Bismark bwa-meth
Deduplicate Alignments Bismark Picard MarkDuplicates
Extract methylation calls Bismark MethylDackel
Sample report Bismark -
Summary Report Bismark -
Alignment QC Qualimap Qualimap
Sample complexity Preseq Preseq
Project Report MultiQC MultiQC

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq_1,fastq_2
SRR389222_sub1,https://github.com/nf-core/test-datasets/raw/methylseq/testdata/SRR389222_sub1.fastq.gz
SRR389222_sub2,https://github.com/nf-core/test-datasets/raw/methylseq/testdata/SRR389222_sub2.fastq.gz
SRR389222_sub2,https://github.com/nf-core/test-datasets/raw/methylseq/testdata/SRR389222_sub3.fastq.gz
Ecoli_10K_methylated,https://github.com/nf-core/test-datasets/raw/methylseq/testdata/Ecoli_10K_methylated_R1.fastq.gz,https://github.com/nf-core/test-datasets/raw/methylseq/testdata/Ecoli_10K_methylated_R2.fastq.gz

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

Now, you can run the pipeline using:

nextflow run nf-core/methylseq --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

These scripts were originally written for use at the National Genomics Infrastructure at SciLifeLab in Stockholm, Sweden.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #methylseq channel (you can join with this invite).

Citations

If you use nf-core/methylseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1343417

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

methylseq's People

Contributors

alesssia avatar alneberg avatar apeltzer avatar colindaven avatar drpatelh avatar edmundmiller avatar ewels avatar felixkrueger avatar gdevailly avatar hammarn avatar kevinmenden avatar ltosti-tagomics avatar mashehu avatar maxulysse avatar nf-core-bot avatar njspix avatar noecochetel avatar noirot avatar nvk747 avatar pditommaso avatar phue avatar robsyme avatar sateeshperi avatar spikyclip avatar sppearce avatar sven1103 avatar wassimsalam01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

methylseq's Issues

Make fastqc time longer

Hi,

The time for fastqc is not enough...
I'm getting the following error:

Error executing process > 'fastqc (TBR532u)'

Caused by:
  Process exceeded running time limit (2h)

Command executed:

  fastqc --quiet --threads 1 TBR532u.R1.fq.gz TBR532u.R2.fq.gz

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /path/to/work/68/387258828eecd198130d8230217c21

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Can you please give it longer running time? Or give it more than one thread?

Thanks!

samtools sort requests too much memory

All previous process run without error.
I noticed that at each retry number of threads and memory is increased.
Any workaround is accepted.
have you thought of using biobambam2 which seems more efficient.
https://gitlab.com/german.tischler/biobambam2

[02/b252cc] Submitted process > qualimap (30jan18_mysample)
[97/8473b0] Submitted process > bismark_methXtract (30jan18_mysample)
[02/b252cc] NOTE: Process `qualimap (30jan18_mysample)` terminated with an error exit status (137) -- Execution is retried (1)
[be/827d8a] Re-submitted process > qualimap (30jan18_mysample)
[be/827d8a] NOTE: Process `qualimap (30jan18_mysample)` terminated with an error exit status (137) -- Execution is retried (2)
[72/1d29bf] Re-submitted process > qualimap (30jan18_mysample)
[72/1d29bf] NOTE: Process `qualimap (30jan18_mysample)` terminated with an error exit status (137) -- Execution is retried (3)
[34/099bd8] Re-submitted process > qualimap (30jan18_mysample)
ERROR ~ Error executing process > 'qualimap (30jan18_mysample)'

Caused by:
  Process `qualimap (30jan18_mysample)` terminated with an error exit status (137)

Command executed:

  samtools sort 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated.bam \
      -m 8589934592 \
      -@ 16 \
      -o 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam
  qualimap bamqc  \
      -bam 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \
      -outdir 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated_qualimap \
      --collect-overlap-pairs \
      --java-mem-size=128G \
      -nt 16

Command exit status:
  137

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  .command.sh: line 5:    21 Killed                  samtools sort 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated.bam -m 8589934592 -@ 16 -o 30jan18_mysample#1_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam

Work dir:
  /tmp/wr_cwd/8/0/3/83ffa1ab79fa53d119320d4e6e069870891302/cwd/work/34/099bd83317d4c063aa0bdde9c49929

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
[nf-core/methylseq] Pipeline Complete
WARN: Killing pending tasks (1)

I opened a PR :)

Just bringing to your attention that I opened a PR, adding the biscuit aligner here..

Good day to you!

Move the test data into a new repository and set up clones

Instead of bundling the test data in for every person who ever runs the pipeline, move to a branch of a separate repo and tag a subrepository. Then when developing and testing we can easily clone recursively to get the data, but it won't be fetched for everyone else.

Workflow runs in my home but not in scratch

Hello,

I am attempting to replace some archaic make based workflows with nf-core/methylseq. I am pretty green with nextflow but also pretty sure something funny is going on here. I am just trying to run the test profile with

nextflow run nf-core/methylseq -profile test,conda

and it runs fine in my home directory with minor errors at preseq process (?). But when I try to do the same exact thing from my scratch dir (another filesystem) I get this error:

(Note that I first got this error with singularity, and tried to set singularity.autoMounts = true but no dice, then I noticed I am getting the same exact error with conda. Also, I use singularity with custom binds on this cluster all the time. )

I have no idea where to go from here.

Error executing process > 'output_documentation'

Caused by:
  Process `output_documentation` terminated with an error exit status (127)

Command executed:

  markdown_to_html.r output.md results_description.html

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 2: markdown_to_html.r: command not found

Add option to change max number of cpus, memory etc.

Running the pipeline (using bwa-meth aligner), I get the following message:

Error executing process > 'qualimap (248933-cf-M.ben_list.b37)'

Caused by:
  Process requirement exceed available CPUs -- req: 12; avail: 10

Can you add an option to specify number of maximum cpus to use?

Max CPU utilization

Hello,
whatever the configuration (even when I set max_cpus 20) bismark will run on 1 core. How can I change that?

implementing bam_merge step for multiple split bam's for a given sample

Hi Felix,

I have implemented additional step which will merge multiple bams for a sample after alignment. e.g.:

  • Sample A: split_0_R1_fastq.gz, split_0_R2_fastq.gz
  • Sample A: split_1_R1_fastq.gz, split_1_R2_fastq.gz
  • Sample B: split_0_R1_fastq.gz, split_0_R2_fastq.gz
  • Sample B: split_1_R1_fastq.gz, split_1_R2_fastq.gz

Below is snippet to handle this.

   /*
     * STEP 3.2 - merge split bams
     */

     if( params.merge ) {  
        sample_name = params.name
        process merge_bams {
            tag "$name"
            publishDir "${params.outdir}/bismark_alignments", mode: 'copy',
                saveAs: {filename ->
                                    if (filename.indexOf(".merged.bam") > 0) filename
                                    else null
                        }

            input:
            file(bams) from ch_bam_for_merge.collect()

            output:
            set val(sample_name), file("*.merged.bam") into ch_bam_for_bismark_deduplicate

            script:
            """
            samtools merge -n ${sample_name}.merged.bam ${bams}
            """
        }
     }   
     else{
         ch_bam_for_merge_pass.into{ch_bam_for_bismark_deduplicate}
     }

I will share the forked branch once did some tidy up.

Thanks,
Shriram

Insert size histogram in multiqc

Distribution of estimated insert sizes of mapped reads looks wrong on multiqc but ok in Qualimap Report: BAM QC (Insert Size Histogram). After normalization by mutiqc (fraction of reads) there is this artificial huge peak in one region.

Rewrite the documentation

After all of the recent refactoring work, much of the documentation is now plain wrong. Much was also missing for the bwa-meth aligner from day one.

Run methylseq with singularity fails!

Hi,

Sorry for asking so many questions, I must succeed with running this pipeline!
I'm running with sungularity, using the test profile:
nextflow -log methylseq_try run nf-core/methylseq -profile singularity,test

and after some text, I'm getting the following error:

Error executing process > 'fastqc (SRR389222_sub1)'

Caused by:
  Process `fastqc (SRR389222_sub1)` terminated with an error exit status (255)

Command executed:

  fastqc --quiet --threads 1 SRR389222_sub1.fastq.gz

Command exit status:
  255

Command output:
  (empty)

Command error:
  ERROR  : Image path /my/home/path/work/singularity/nfcore-methylseq-1.4.img doesn't exist: No such file or directory
  ABORT  : Retval = 255

Running with the same dockerfile, but as a local dockerfile and with local data, gives out:

Error executing process > 'fastqc (149122-cf-M.ben_list.b37)'

Caused by:
  Process `fastqc (149122-cf-M.ben_list.b37)` terminated with an error exit status (255)

Command executed:

  fastqc --quiet --threads 1 149122-cf-M.ben_list.b37.sorted_1.fastq 149122-cf-M.ben_list.b37.sorted_2.fastq

Command exit status:
  255

Command output:
  (empty)

Command error:
  ERROR  : Unknown image format/type: /my/home/path/methylseq/Dockerfile
  ABORT  : Retval = 255
  

What is that?
Tell me if you need the log file.
I'm using nextflow version 19.10.0 build 5170 and singularity 2.6.0-dist

Thanks!

Fix the readme badges

They need some love...

  • Minimum nextflow version is wrong
  • Docker badge
  • Bioconda badge

Define conda channel names in environment file

Docker builds are failing because the conda solving environment takes too long. Could be worth specifying the channel for each package to try to speed this up.

eg. bioconda::multiqc=1.7 instead of just multiqc=1.7

Bismark methXtract to use n/3 cpus instead of n/10

Hi Phil,

Thank you for this pipeline it is extremely helpful and has been a huge help so far in my PhD project!

I think the multicore calculation for bismark Methylation extraction is incorrect in the main.nf line 645.

The script defines each multicore as no.CPUs /10 for the methylation extraction however looking at the Bismark documentation this should be no.CPUs/3:

--multicore <int> Sets the number of cores to be used for the methylation extraction process.

If system resources are plentiful this is a viable option to speed up the extraction process (we
observed a near linear speed increase for up to 10 cores used). Please note that a typical process of extracting a BAM file and writing out '.gz' output streams will in fact use ~3 cores per value of --multicore <int> specified (1 for the methylation extractor itself, 1 for a Samtools stream, 1 for GZIP stream), so --multicore 10 is likely to use around 30 cores of system resources. This option has no bearing on the speed of the bismark2bedGraph or genome-wide cytosine report processes

I think this explains why despite feeding the job 24 CPUs and 128GB RAM and specifying this in the config file only 2 multicores were used. Looking at the pipeline report only 4CPUs and 23GB RAM were utilised for the process and it took over 23hours to run on a human sample with 72 million paired reads. In contrast the Bismark alignment process finished in 9 hours and used 12 CPUs and 80GB RAM (hg37d5 as reference) running with 8 multicores.

Actually on this note, i wonder whether the smaller human reference genome and the figures above suggest a different cpu_per-multicore strategy might be possible e.g. 2 cpus per multicore for smaller genomes?

thanks

Joe

Reference params set before iGenomes config is loaded

I think in the nextflow.config file, in line
bwa_meth_index = params.genome ? params.genomes[ params.genome ].bwa_meth ?: false : false

Doesn't it have to be
bwa_meth_index = params.genome ? params.genomes[ params.genome ].bwa ?: false : false

Because in the igenome config, there is no field name bwa_meth, and there is a field bwa

Get the quay.io docker image working

Quay.io currently won't allow me to set up automated builds, giving me a HTML 500 error each time I try. Need to come back to this in a few days and try again.

specify available memory to Picard

Hello,

Thank you for this fantastic workflow. I have found it extremely useful.

I had an issue when running larger .fastq files (64Gb each R1 and R2) through the bwa-mem workflow in which java would throw an error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

After some reading I found this was related to the java heap space.

I was able to get around this by increasing the heap space within the main.nf file (prior to markDuplicate command) using the following command export _JAVA_OPTIONS="-Xmx200g".

Thought I would leave this here in case others had the same issue. This is a solution, not sure if it is the best.

Cheers,

Zac

Support for multiple samples

It would be a good idea to support multiple samples so the user does not need to run the same command many times. I have 40 samples to analyse now and I have to run the command 40 times

Running methyseq with bwa-meth: methyldackel run only on one sample

I'm trying to run the pipeline using "bwameth" aligner.

The command is this:

nextflow -log methylseq_bwameth run -r dev -w work_try  nf-core/methylseq -with-singularity /tmp/singularity/methylseq_1.1.img -profile singularity --aligner bwameth  --reads 'data/run/fastq_files/links/*sorted_{1,2}.fastq' --name try_biscuit_pipeline --save_reference --genome hg38 --fasta  Genomes/hg38/bwa-meth/hg38.fa  --fasta_index data/Genomes/hg38/hg38.fa.fai --bwa_meth_index Genomes/hg38/bwa-meth/hg38.fa  --max_cpus 8 --skip_trimming --max_memory '60GB'

Everything works fine, until the methyldackel step. It's working only on one sample!
The end of the log is here:


executor >  local (22)
[ba/f84f0d] process > get_software_versions          [100%] 1 of 1 ✔
[48/040213] process > fastqc (149122-exo-X-60min.... [100%] 3 of 3 ✔
[31/e89b79] process > bwamem_align (149122-exo-X-... [100%] 3 of 3 ✔
[f5/30f105] process > samtools_sort_index_flagsta... [100%] 3 of 3 ✔
[d3/b6e063] process > markDuplicates (149122-exo-... [100%] 3 of 3 ✔
[07/8b5889] process > methyldackel (248933-cf-M.b... [100%] 1 of 1 ✔
[18/bac7a5] process > qualimap (149122-exo-X-60mi... [100%] 3 of 3 ✔
[85/1d3205] process > preseq (149122-exo-X-60min.... [100%] 3 of 3 ✔
[cc/a4a896] process > multiqc                        [100%] 1 of 1 ✔
[78/9bb626] process > output_documentation           [100%] 1 of 1 ✔
[0;35m[nf-core/methylseq] Pipeline completed successfully
Completed at: 22-Dec-2019 15:04:39
Duration    : 6m 40s
CPU hours   : 0.7
Succeeded   : 22

From some attempts I made, I think the problem is because the methyldackel step get a fasta as input, and somehow it makes him running only once.

Can someone help me?

No fasta file when running the pipeline in local computer

Hello,

I'm trying to run the pipeline after cloning to my local computer:

nextflow -log methylseq_log run methylseq/main.nf --aligner 'bwameth' --reads 'data/run_bwameth/fastq_files/*/*sorted_{1,2}.fastq' -profile singularity --genome 'GRCh37' --outdir 'data/run_bwameth/nextflow' --email [email protected] --email_on_fail [email protected] --name try_pipeline

and I get the error No Fasta reference specified!
The fasta parameter is actually false, but why? printing ${params.genomes[params.genome].fasta} gives me the right fasta file s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.
So why does the fasta parameter get false value?

The config file, after all defaults jas this lines:

Nov-25 13:07:50.844 [main] DEBUG nextflow.Session - Session aborted -- Cause: No Fasta reference specified! false, s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa. Expression: params.fasta
Nov-25 13:07:50.866 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.AssertionError: No Fasta reference specified!  Expression: params.fasta
        at org.codehaus.groovy.runtime.InvokerHelper.assertFailed(InvokerHelper.java:417)
        at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.assertFailed(ScriptBytecodeAdapter.java:670)
        at Script_372c9fdc.runScript(Script_372c9fdc:134)
        at nextflow.script.BaseScript.run0(BaseScript.groovy:152)
        at nextflow.script.BaseScript.run(BaseScript.groovy:182)
        at nextflow.script.ScriptParser.runScript(ScriptParser.groovy:217)
        at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:218)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:126)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:257)
        at nextflow.cli.Launcher.run(Launcher.groovy:457)
        at nextflow.cli.Launcher.main(Launcher.groovy:639)

--cytosine_report for bismark_methXtract

Hello,

for downstream analysis I need the strand of each cytosine. To get it I would like to add --cytosine_report in comand line bismark_methylation_extractor and to save outputs *.CpG_report.txt.gz in results/bismark_methylation_calls/stranded_CpG_report

If you are ok with this enhancement I can develop it and make a PR ?!
Céline

Batch run for all samples in data dir

I am using nf-core-methylseq-1.2, with conda environment installed from environment.yml, including java 8 in that environment. Conda Source environment is activated on local HPC.

  1. Error in calling the software versions used
[69/77bf6d] Submitted process > get_software_versions
[69/77bf6d] NOTE: Missing output file(s) `software_versions_mqc.yaml` expected by process `get_software_versions` -- Error is ignored

Pipeline is still running so not big problem but would be Nice if it reports versions ….

  1. under directory data, there are many samples. The process is done only for 1 sample (R1 and R2). Why not for all? Exception for that is fastqc for which all samples are done.
nextflow run /home/user/methylseq-1.2/ -resume --reads "/WORKING/projects/DNAm_project/genome/data/ZTRX_accel1S_S13_R{1,2}_*.fastq" --outdir . --fasta "references/Homo_sapiens/UCSC/hg38/Sequence/" --bismark_index "references/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/" --accel --saveTrimmed  --saveAlignedIntermediates --max_memory "128.GB" --max_cpu "8" --max_time "240.h"

Thanks

Nondeterministic execution of bismark_report task

Trying this pipeline I've noticed that the input definition for the process bismark_report is seriously flawed:

input:
    file bismark_align_log_1
    file bismark_dedup_log_1
    file bismark_splitting_report_1
    file bismark_mbias_1

It gets inputs from:

  • bismark_align
  • bismark_deduplicate
  • bismark_methXtract
  • bismark_methXtract

Each of the above processes have an nondeterministic execution order (due to parallel execution) and there's no guarantees that the inputs are received in the same order for different processes. This means that bismark_report is receiving a messed input composition.

The output should include a read name component (instead of extracting it from the file name!) and use it to recompose the input channel for the bismark_report using the join operator.

Add --methylKit flag to MethylDackel

Hi, I am trying to use -profile conda on HPC with following code 1) and error 2)
The conda env is downloaded automatically to work/conda and runs first few processes for all samples after which it crashes. Thanks

./nextflow run /rds/homes/w/wm/methylseq-1.3/ -resume  -profile conda \
  --reads '/rds/projects/2016/mw/Epi/epidata/170310_D00261_0393_BCAJU6ANXX_8_IL-TP-*{1,2}.fastq.gz' \
  --outdir . \
  --aligner 'bwameth' \
  --fasta '/rds/reference/pa42_assembly_v3.fasta' \
  --zymo --saveTrimmed --saveAlignedIntermediates --unmapped --comprehensive \
  --max_memory 65.GB --memory 30.GB --max_cpus 5 \

Error is following:

[warm up] executor > local
[07/4f4158] Cached process > get_software_versions
[27/c34b8e] Cached process > makeFastaIndex (pa42_assembly_v3.fasta)
[89/339d82] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-009)
[41/4dd6bc] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-006)
[6d/7e7171] Cached process > makeBwaMemIndex (pa42_assembly_v3.fasta)
[d9/3467b2] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-003)
[51/7c1f53] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-005)
[b1/542310] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-012)
[f1/c714af] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-008)
[67/13b7cc] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-002)
[87/b332d8] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-001)
[42/77a29f] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-004)
[3a/33df5f] Cached process > fastqc (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-007)
[d5/34d865] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-004)
[d6/009478] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-005)
[69/f3f511] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-012)
[75/175365] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-007)
[dd/0d9c46] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-006)
[82/b801e0] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-008)
[da/8889a9] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-001)
[86/7bf673] Cached process > trim_galore (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-002)
[4f/c3192a] Cached process > bwamem_align (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-008)
ERROR ~ Error executing process > 'samtools_sort_index_flagstat (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-008)'

Caused by:
  Cannot invoke method toBytes() on null object

Source block:
  """
  samtools sort $bam \\
      -m ${task.memory.toBytes() / task.cpus} \\
      -@ ${task.cpus} \\
      -o ${bam.baseName}.sorted.bam
  samtools index ${bam.baseName}.sorted.bam
  samtools flagstat ${bam.baseName}.sorted.bam > ${bam.baseName}_flagstat_report.txt
  samtools stats ${bam.baseName}.sorted.bam > ${bam.baseName}_stats_report.txt
  """

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
[be/e133bb] Submitted process > bwamem_align (170310_D00261_0393_BCAJU6ANXX_8_IL-TP-006)
[nf-core/methylseq] Pipeline Complete
WARN: Killing pending tasks (1)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

Make a where_are_my_files.txt

In the rnaseq pipeline we save a text file called where_are_my_files.txt when the intermediate BAM files etc are not saved to results. Would be good to do that here too.

Make bismark align mem/cpu_per_multicore configurable

It would be nice to have a params.bismark_align_cpu_per_multicore and params.bismark_align_mem_per_multicore and use these if set instead of trying to use sensible default. Then the user can decide on the best setup for their use case if they want to, and squeeze more performance out of their system.

See #116 (comment)

Add preseq

It would be nice to add in preseq as a standard step to assess library complexity.

We already have this in the rnaseq pipeline, so it should be pretty easy to add.

Bismark hisat2 support

Hello,
Could you update the pipeline with the supported hisat2 aligner in the last version of bismark?


[Edit]: to-do list added by @ewels - head comment to give progress bar in issues

  • bismark genome preparation

    • Add --hisat option
  • bismark

    • Add --hisat pipeline option
    • Add --no-spliced-aligments bismark flag by default if using --hisat
      skimming through the bismark code, it seems that this is actually set by default
    • Add --spliced-aligments pipeline option to remove the above option when using hisat.
    • Add --known-splicesite-infile <path> pipeline option
  • Add hisat2 to the conda environment

  • Add new channels to handle hisat2 reference genome indexes (including splice sites if given)

  • Make all of this work with new pipeline params

    • --aligner bismark_hisat would be my suggestion
    • --spliced-aligments and --known-splicesite-infile
  • Logging and documentation

  • update bismark bioconda recipe to v0.21.0

  • change travis file to cover hisat2 alignment

Complexity curve

  1. Looking into complexity curve. Could it be that for PE reads we actually need to add the -P parameter in preseq? This was not detected in the pipeline automatically.
samtools sort 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.bam \
         -m 8589934592 \
         -@ 1 \
         -o 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam
     preseq lc_extrap -v -B 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam -o 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.ccurve.txt 
  1. Also, for very small amounts of the reads this program is not working, but I guess that is my problem with reads of total number 119,573. Error: max count before zero is less than min required cound (4), sample not sufficiently deep or duplicates removed.

original command:

BAM_INPUT
TOTAL READS     = 119573
DISTINCT READS  = 119456
DISTINCT COUNTS = 3
MAX COUNT       = 3
COUNTS OF 1     = 119342
MAX TERMS       = 2
OBSERVED COUNTS (4)
1       119342
2       111
3       3

ERROR:  max count before zero is les than min required count (4), sample not sufficiently deep or duplicates removed

and if I implement PE:

PAIRED_END_BAM_INPUT
paired = 119572
unpaired = 0
MERGED PAIRED END READS = 119572
MATES PROCESSED = 239144
TOTAL READS     = 119572
DISTINCT READS  = 119485
DISTINCT COUNTS = 2
MAX COUNT       = 2
COUNTS OF 1     = 119398
MAX TERMS       = 2
OBSERVED COUNTS (3)
1       119398
2       87

ERROR:  max count before zero is les than min required count (4), sample not sufficiently deep or duplicates removed

Update badge to travis-ci.com

We've been migrating this over to travis-ci.com (as all travis-ci.org repositories will be shutdown at some point). Please update the badge in the readme accordingly :-)

Add SNP calling

It would be nice to be able to have the option of calling variants from bisulfite data.

It shouldn't be too tricky to add Bis-SNP or something similar as a new opt-in process. There may be other / better tools also?

MultiQC saving files to (unreachable)/(unreachable)

Hello again,

I tried to run the methylseq pipeline, using docker:
nextflow run nf-core/methylseq -profile test,docker

I got the following error:

[0;35m[nf-core/methylseq] Pipeline completed with errors
Error executing process > 'multiqc'

Caused by:
  Missing output file(s) `*multiqc_report.html` expected by process `multiqc`

Command executed:

  multiqc -f   --config multiqc_config.yaml . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

Command exit status:
  0

Command output:
  (empty)

Command error:
  /opt/conda/envs/nf-core-methylseq-1.4/lib/python2.7/site-packages/multiqc/utils/config.py:45: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    configs = yaml.load(f)
  /opt/conda/envs/nf-core-methylseq-1.4/lib/python2.7/site-packages/multiqc/utils/config.py:51: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    sp = yaml.load(f)
  /opt/conda/envs/nf-core-methylseq-1.4/lib/python2.7/site-packages/multiqc/utils/config.py:121: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    new_config = yaml.load(f)
  [WARNING]         multiqc : MultiQC Version v1.8 now available!
  [INFO   ]         multiqc : This is MultiQC v1.7
  [INFO   ]         multiqc : Template    : default
  [INFO   ]         multiqc : Searching '.'
  [INFO   ]         multiqc : Only using modules custom_content, picard, qualimap, bismark, samtools, preseq, cutadapt, fastqc
  [INFO   ]         bismark : Found 3 bismark alignment reports
  [INFO   ]         bismark : Found 3 bismark dedup reports
  [INFO   ]         bismark : Found 3 bismark methextract reports
  /opt/conda/envs/nf-core-methylseq-1.4/lib/python2.7/site-packages/multiqc/modules/custom_content/custom_content.py:84: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    parsed_data = yaml.load(f['f'])
  [INFO   ]  custom_content : software_versions: Found 1 sample (html)
  [INFO   ]  custom_content : software_versions: Found 1135 samples (html)
  [INFO   ]  custom_content : nf-core-methylseq-summary: Found 1 sample (html)
  [INFO   ]  custom_content : nf-core-methylseq-summary: Found 2321 samples (html)
  /opt/conda/envs/nf-core-methylseq-1.4/lib/python2.7/site-packages/matplotlib/axes/_base.py:3152: UserWarning: Attempting to set identical left==right results
  in singular transformations; automatically expanding.
  left=0, right=0
    'left=%s, right=%s') % (left, right))
  [INFO   ]        qualimap : Found 3 BamQC reports
  [INFO   ]        cutadapt : Found 3 reports
  [INFO   ]          fastqc : Found 3 reports
  [INFO   ]         multiqc : Compressing plot data
  [INFO   ]         multiqc : Report      : (unreachable)/(unreachable)/multiqc_report.html
  [INFO   ]         multiqc : Data        : (unreachable)/(unreachable)/multiqc_data
  [INFO   ]         multiqc : Plots       : (unreachable)/(unreachable)/multiqc_plots
  [INFO   ]         multiqc : MultiQC complete

Work dir:
  /Users/katsman/work/11/da283d6125f8e3981293c8d2e00c18

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

During the run, I noticed that preseq didn't work well (exit code 139 or exit code 1).
The multiqc output is (unreachable)
Attached the log.

nextflow_log.txt

Please help me! That's the most basic run of this pipeline, and it's not working!

Edit: when cloning the repository from github, and running with: nextflow run methylseq/main.nf -profile test,docker -it''s working, the preseq error is ignored.

Trimming question + analysis of specific regions

Hi all, @FelixKrueger @ewels

I tried to run the pipeline with some data, but I had no idea about how they have been produced. I know just that they are methylation data. SO when fastqc run, I saw that they have bad quality at the end. I wonder if trim galore should have an option in this pipeline to cut bases by quality (5' and 3' Trimming) instead of just raw bases.

Also, I will have some methylation data for 3 specific genes in the next days. I looked into Bismark but I didn't see any option to narrow down the search windows for methylation given eg a bed file. Searching the internet I found this approach:

https://github.com/ENCODE-DCC/dna-me-pipeline

Any thoughts?

Originally posted by @kokyriakidis in #85 (comment)

Release version 1.4!

Pipeline currently requires pipeline-specific config files, should probably be replaced with nf-core/configs.

Reference genome downloaded for each sample

Let's say I start a new project with 40 samples, running methylseq with Singularity. At each iteration of bismark_align it downloads the genome files again and again and does not check if it has them already downloaded. This leads to an enormous amount of data storage needs and slow overall time. I run the pipeline with 1 sample only and with --saveReference option in order to download them and write its path on the config file.

I noticed this when I saw the network usage every time bismark_align starts. I goes max!

Also --saveReferences does not output the genome files in Results folder as supposed!

it happening or am I doing something wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.