ccbr / aspen Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 2.0 631.58 MB

CCBR pipeline for preliminary QC and peak calling from ATACseq datasets

Home Page: https://ccbr.github.io/ASPEN/

License: MIT License

Shell 41.79% Python 40.40% R 17.81%

atac-seq-pipeline ngs-pipeline macs2 genrich multiqc

aspen's People

Contributors

Watchers

Forkers

abcsfrederick ncipangea

aspen's Issues

"consensus" peaks are generated even when samples have only 1 replicate

consensus peak should be same as the replicates' peak call.
no need to redo everything downstream

Add dyanmic partition based off of buyinnodes

Similar functionality was added to TOBIAS pipeline... port it here

jaccard issue

Issue with jaccard: genrich uses a non-canonical sorting for peak files (i.e. not name sorted,but numeric by chromosome).

This causes jaccard on consensus files step to fail in comparing samples with only 1 replicate to those with replicates. For the consensus peak file for samples with replicates (name sorted), while it's not for the single replicate. To be on safe side -- do a bedsort on the consensus bedfile in cases where nrep=1

(e.g. line 186 of ccbr_atac_genrich_peak_calling.bash cut -f1-3 $PEAKFILE1 | sort -k1,1 -k2,2n > $CONSENSUSBEDFILE)

add contrast comparisons between peaks

updating config to take in contrast.tsv
create rules to run deeptools on samples (correlation matrix, heatmaps, profiles, summary)
create rules to create counts matrix
create rules to perform DESEQ between contrasts - output a diffresults.txt
create rules to perform conversion of diffresults to bed file

Need to add single-end option

Problem

Currently the pipeline is only capable of running paired-end reads. Have a project that is single-end. Will update pipeline to run this project.

`runinfo.yaml` stat error

stat error is displayed if runinfo.yaml file does not exist
Not a problem... but "errors" make it look bad!

warning during `git clone`: large files should be handled with `git lfs`

Warning message when attempting to clone this repo:

Cloning into 'ASPEN'...
remote: Enumerating objects: 1103, done.
remote: Counting objects: 100% (200/200), done.
remote: Compressing objects: 100% (131/131), done.
remote: Total 1103 (delta 71), reused 158 (delta 64), pack-reused 903
Receiving objects: 100% (1103/1103), 645.99 MiB | 3.98 MiB/s, done.
Resolving deltas: 100% (604/604), done.
Updating files: 100% (125/125), done.
Encountered 4 files that should have been pointers, but weren't:
        resources/blacklistFa/hs1.blacklist.fa.gz
        resources/frip/hs1.DHS.bed.gz
        resources/frip/hs1.enhancers.bed.gz
        resources/tssBed/hs1_tssbeds.tar.gz

This causes an error when trying to commit other unrelated changes:

An unexpected error has occurred: CalledProcessError: command: ('/Library/Developer/CommandLineTools/usr/libexec/git-core/git', '-c', 'core.autocrlf=false', 'apply', '--whitespace=nowarn', '/Users/sovacoolkl/.cache/pre-commit/patch1704842321-37084')
return code: 1
stdout: (none)
stderr:
    error: the patch applies to 'resources/tssBed/hs1_tssbeds.tar.gz' (555ec58cde1546d9624e100af8985918dfcfbc3e), which does not match the current contents.
    error: resources/tssBed/hs1_tssbeds.tar.gz: patch does not apply
Check the log at /Users/sovacoolkl/.cache/pre-commit/pre-commit.log

Looks like something went wrong when these files were initially commited?

Add hs1 or T2T support

create resource files
run test case

Contrast file is "required"

print message that no contrast will be run if no file provided

aggregate error in DiffATAC reported by Krithika

log files shows:

Activating singularity image /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/.snakemake/singularity/f1377
d9e7c36f023245c48c6243ba596.simg
WARNING: While bind mounting '/data/CCBR_Pipeliner:/data/CCBR_Pipeliner': destination is already in the mount
 point list
+ TMPDIR=/lscratch/16386359
+ '[' '!' -d /lscratch/16386359 ']'
++ dirname /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/degs.done
+ outdir=/vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ cd /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ Rscript /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/scripts/aggregate_results_runner.R --countsmatr
ix /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/ROI.counts.tsv --diffatacdir /vf
/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC --coldata /vf/users/CCRCCDI/a
nalysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/sampleinfo.txt --foldchange 2 --fdr 0.05 --in
dexcols Geneid --excludecols Chr,Start,End,Strand,Length --diffatacdir /vf/users/CCRCCDI/analysis/ccrtegs4/at
ac_test/test2/results/peaks/genrich/DiffATAC --tmpdir /lscratch/16386359 --scriptsdir /vf/users/CCRCCDI/analy
sis/ccrtegs4/atac_test/test2/scripts


processing file: aggregate_results.Rmd

Quitting from lines 72-102 [allsamplepca] (aggregate_results.Rmd)
Error in `vst()`:
! less than 'nsub' rows,
  it is recommended to use varianceStabilizingTransformation directly
Backtrace:
 1. DESeq2::vst(dds1)
Execution halted

set default singularity cache directory to `$WORKDIR/.singularity`

similar to RENEE's behavior.
Otherwise, users will run out of space if their SINGULARITY_CACHEDIR env var is set to the default in ~/.singularity

add Counts Matrix to workflow

Ask: add countsmatrix to the workflow

Plan:

take output from atac_genrich_peakcalling_fixed_width :
renormalizedConsensusNarrowPeak=join(RESULTSDIR,"peaks","genrich","fixed_width","{sample}.renormalized.fixed_width.consensus.narrowPeak")
run python script
python {params.pyscript} --bedbedgraph {input.inputs} --tmpdir $TMPDIR --countsmatrix {output.cm} --fragmentscountsmatrix {output.fcm} --sampleinfo {output.si}

Diff error

DESeq2 script needs fixing.

Add code for fixed_width peak calling

fixed width peaks
fixed width consensus peaks

copy cluster.json to WORKDIR

cluster.json is to be copied over to the workdir so it can be modified on a case-by-case basis

What to do if no MANIFEST

use the default manifest from PIPELINES_HOME

ASAP: Deeptools peaks

Use reads pooled bw files to generate heatmaps using deeptools for:

around TSS
metagene

Change location of where the pipeline reads cluster.yaml

Currently:
Pipeline reads from /PIPELINEHOME/cluster.yaml file instead of /WORKDIR/cluster.yaml

Problem:
If the user wants to change params for a specific project (IE increase the resources for larger inputs) then they can only current do this within the pipeline home dir. Instead this should be done at a per-project-level

Add mouse support for Diff ATAC

Currently only human (hg38) is supported. Add mouse (mm10) support

Check if jobby running correctly.

Spooker report has no ASPEN data points. investigate why?

sample error with samples > 39

I kicked off a run with yesterday without a problem (N=17). Tried to do a second run with more samples (N=40) and keep getting an error during init.smk.

I've been trying t troubleshoot it and the only thing that I can figure out is if I run it with 39 samples it runs fine, but as soon as I had the remaining sample, it errors. Doesn't matter which sample is at the end, always errors.

Specifically, the error is happening during init.smk, when its going through the reps to make sure that R1/R2 exist. I edited it to print out the rep, r1, r2 values

this works as normal for sample N=38:

rep:  NHP27_TP4_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz

but then, after when it gets to sample N=39 it does this:

rep:  NHP21_TP1_CD14negCD16pos
R1:  replicateName
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R1_fastq, dtype: object
R2:  replicateName
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R2_fastq, dtype: object

It seems to be repeating the entire df again. If I run with N=39 it prints as expected.

rep:  NHP27_TP4_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz
rep:  NHP21_TP1_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r2.fq.gz

I've also printed out the replicates, and those print as expected for both
N=39

['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']

N=40

['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']

Finally, I have tried removing extra lines and carriages and that hasn't fixed it either. I don't get where the issue is coming from.

Attaching samples file as example.
samples.txt

dryrun error

TypeError in line 39 of /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3/workflow/rules/init.smk:
stat: path should be string, bytes, os.PathLike or integer, not Series

Using ASAP found here

/data/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3

Need an extra pair of eyes, @kopardev!

update citation file with Zenodo DOI after we cut the next release

Check replicate names are unique

Add code to check samples.tsv to ensure that replicate names are unique. If names are not unique error will occur at init.smk during FQ check (see issue #8)

tmp

Summary

Project failure at three rules for all samples.
Choose one sample to review issue: DGE0_CD3_Wk4_BCG.
Rules failing:
- Error in rule atac_fld:
- Error in rule atac_genrich_peakcalling:
- Error in rule atac_macs_peakcalling:

Project location

/data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20

Rule code for atac_fld

cat /data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/logs/54331095.54336662.atac_fld.replicate=DGE0_CD3_Wk4_BCG.err

python /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/scripts/ccbr_atac_bam2FLD.py -i /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGE0_CD3_Wk4_BCG.dedup.bam -o /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/QC/fld/DGE0_CD3_Wk4_BCG.fld.txt

Review input; no reads found

samtools view /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGP4_CD3_Wk4_BCG.dedup.bam

Review QC for bam file

cat DGE0_CD3_Wk4_BCG.bowtie2.bam.flagstat
67647231 + 0 in total (QC-passed reads + QC-failed reads)
16561337 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66637105 + 0 mapped (98.51% : N/A)
51085894 + 0 paired in sequencing
25542947 + 0 read1
25542947 + 0 read2
48207136 + 0 properly paired (94.36% : N/A)
49686192 + 0 with itself and mate mapped
389576 + 0 singletons (0.76% : N/A)
92274 + 0 with mate mapped to a different chr
38075 + 0 with mate mapped to a different chr (mapQ>=5)

read access not given for v.0.6.1

Currently not all files within this path have read access, and therefore the pipeline cannot run:

/data/CCBR_Pipeliner/Pipelines/ASAP/v0.6.1

@kopardev - please update permissions. I've created a tmp dir in the meantime which has read access (and write access) that can be deleted once this is changed.

Thanks!

jaccard time out; max out on CPUS

Running jaccard on recent mmul dataset has lead to time out of samples. Current resource allocation should be adjusted
Current resources

        "jaccard": {
                "mem": "48g",
                "threads": "2",
                "time": "01-00:00:00"
        },

Proposed changes

        "jaccard": {
                "mem": "40g",
                "threads": "32",
                "time": "02-00:00:00"
        },

Memory and time can probably be dropped with the increase in threads; will follow up after re-running current samples through with these changes.