ccbr / aspen Goto Github PK
View Code? Open in Web Editor NEWCCBR pipeline for preliminary QC and peak calling from ATACseq datasets
Home Page: https://ccbr.github.io/ASPEN/
License: MIT License
CCBR pipeline for preliminary QC and peak calling from ATACseq datasets
Home Page: https://ccbr.github.io/ASPEN/
License: MIT License
script copies are kept in the workdir
Similar functionality was added to TOBIAS pipeline... port it here
Issue with jaccard: genrich uses a non-canonical sorting for peak files (i.e. not name sorted,but numeric by chromosome).
This causes jaccard on consensus files step to fail in comparing samples with only 1 replicate to those with replicates. For the consensus peak file for samples with replicates (name sorted), while it's not for the single replicate. To be on safe side -- do a bedsort on the consensus bedfile in cases where nrep=1
(e.g. line 186 of ccbr_atac_genrich_peak_calling.bash cut -f1-3 $PEAKFILE1 | sort -k1,1 -k2,2n > $CONSENSUSBEDFILE)
Currently the pipeline is only capable of running paired-end reads. Have a project that is single-end. Will update pipeline to run this project.
stat error is displayed if runinfo.yaml
file does not exist
Not a problem... but "errors" make it look bad!
Warning message when attempting to clone this repo:
Cloning into 'ASPEN'...
remote: Enumerating objects: 1103, done.
remote: Counting objects: 100% (200/200), done.
remote: Compressing objects: 100% (131/131), done.
remote: Total 1103 (delta 71), reused 158 (delta 64), pack-reused 903
Receiving objects: 100% (1103/1103), 645.99 MiB | 3.98 MiB/s, done.
Resolving deltas: 100% (604/604), done.
Updating files: 100% (125/125), done.
Encountered 4 files that should have been pointers, but weren't:
resources/blacklistFa/hs1.blacklist.fa.gz
resources/frip/hs1.DHS.bed.gz
resources/frip/hs1.enhancers.bed.gz
resources/tssBed/hs1_tssbeds.tar.gz
This causes an error when trying to commit other unrelated changes:
An unexpected error has occurred: CalledProcessError: command: ('/Library/Developer/CommandLineTools/usr/libexec/git-core/git', '-c', 'core.autocrlf=false', 'apply', '--whitespace=nowarn', '/Users/sovacoolkl/.cache/pre-commit/patch1704842321-37084')
return code: 1
stdout: (none)
stderr:
error: the patch applies to 'resources/tssBed/hs1_tssbeds.tar.gz' (555ec58cde1546d9624e100af8985918dfcfbc3e), which does not match the current contents.
error: resources/tssBed/hs1_tssbeds.tar.gz: patch does not apply
Check the log at /Users/sovacoolkl/.cache/pre-commit/pre-commit.log
Looks like something went wrong when these files were initially commited?
log files shows:
Activating singularity image /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/.snakemake/singularity/f1377
d9e7c36f023245c48c6243ba596.simg
WARNING: While bind mounting '/data/CCBR_Pipeliner:/data/CCBR_Pipeliner': destination is already in the mount
point list
+ TMPDIR=/lscratch/16386359
+ '[' '!' -d /lscratch/16386359 ']'
++ dirname /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/degs.done
+ outdir=/vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ cd /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ Rscript /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/scripts/aggregate_results_runner.R --countsmatr
ix /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/ROI.counts.tsv --diffatacdir /vf
/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC --coldata /vf/users/CCRCCDI/a
nalysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/sampleinfo.txt --foldchange 2 --fdr 0.05 --in
dexcols Geneid --excludecols Chr,Start,End,Strand,Length --diffatacdir /vf/users/CCRCCDI/analysis/ccrtegs4/at
ac_test/test2/results/peaks/genrich/DiffATAC --tmpdir /lscratch/16386359 --scriptsdir /vf/users/CCRCCDI/analy
sis/ccrtegs4/atac_test/test2/scripts
processing file: aggregate_results.Rmd
Quitting from lines 72-102 [allsamplepca] (aggregate_results.Rmd)
Error in `vst()`:
! less than 'nsub' rows,
it is recommended to use varianceStabilizingTransformation directly
Backtrace:
1. DESeq2::vst(dds1)
Execution halted
similar to RENEE's behavior.
Otherwise, users will run out of space if their SINGULARITY_CACHEDIR
env var is set to the default in ~/.singularity
Ask: add countsmatrix to the workflow
Plan:
DESeq2 script needs fixing.
cluster.json is to be copied over to the workdir so it can be modified on a case-by-case basis
Use reads pooled bw files to generate heatmaps using deeptools for:
Currently:
Pipeline reads from /PIPELINEHOME/cluster.yaml file instead of /WORKDIR/cluster.yaml
Problem:
If the user wants to change params for a specific project (IE increase the resources for larger inputs) then they can only current do this within the pipeline home dir. Instead this should be done at a per-project-level
Currently only human (hg38) is supported. Add mouse (mm10) support
I kicked off a run with yesterday without a problem (N=17). Tried to do a second run with more samples (N=40) and keep getting an error during init.smk.
I've been trying t troubleshoot it and the only thing that I can figure out is if I run it with 39 samples it runs fine, but as soon as I had the remaining sample, it errors. Doesn't matter which sample is at the end, always errors.
Specifically, the error is happening during init.smk, when its going through the reps to make sure that R1/R2 exist. I edited it to print out the rep, r1, r2 values
this works as normal for sample N=38:
rep: NHP27_TP4_CD14negCD16pos
R1: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz
but then, after when it gets to sample N=39 it does this:
rep: NHP21_TP1_CD14negCD16pos
R1: replicateName
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R1_fastq, dtype: object
R2: replicateName
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R2_fastq, dtype: object
It seems to be repeating the entire df again. If I run with N=39 it prints as expected.
rep: NHP27_TP4_CD14negCD16pos
R1: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz
rep: NHP21_TP1_CD14negCD16pos
R1: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r1.fq.gz
R2: /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r2.fq.gz
I've also printed out the replicates, and those print as expected for both
N=39
['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']
N=40
['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']
Finally, I have tried removing extra lines and carriages and that hasn't fixed it either. I don't get where the issue is coming from.
Attaching samples file as example.
samples.txt
dryrun error
TypeError in line 39 of /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3/workflow/rules/init.smk:
stat: path should be string, bytes, os.PathLike or integer, not Series
Using ASAP found here
/data/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3
Need an extra pair of eyes, @kopardev!
Add code to check samples.tsv to ensure that replicate names are unique. If names are not unique error will occur at init.smk during FQ check (see issue #8)
Summary
DGE0_CD3_Wk4_BCG
.Project location
/data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20
Rule code for atac_fld
cat /data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/logs/54331095.54336662.atac_fld.replicate=DGE0_CD3_Wk4_BCG.err
python /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/scripts/ccbr_atac_bam2FLD.py -i /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGE0_CD3_Wk4_BCG.dedup.bam -o /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/QC/fld/DGE0_CD3_Wk4_BCG.fld.txt
Review input; no reads found
samtools view /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGP4_CD3_Wk4_BCG.dedup.bam
Review QC for bam file
cat DGE0_CD3_Wk4_BCG.bowtie2.bam.flagstat
67647231 + 0 in total (QC-passed reads + QC-failed reads)
16561337 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66637105 + 0 mapped (98.51% : N/A)
51085894 + 0 paired in sequencing
25542947 + 0 read1
25542947 + 0 read2
48207136 + 0 properly paired (94.36% : N/A)
49686192 + 0 with itself and mate mapped
389576 + 0 singletons (0.76% : N/A)
92274 + 0 with mate mapped to a different chr
38075 + 0 with mate mapped to a different chr (mapQ>=5)
Currently not all files within this path have read access, and therefore the pipeline cannot run:
/data/CCBR_Pipeliner/Pipelines/ASAP/v0.6.1
@kopardev - please update permissions. I've created a tmp dir in the meantime which has read access (and write access) that can be deleted once this is changed.
Thanks!
Running jaccard on recent mmul dataset has lead to time out of samples. Current resource allocation should be adjusted
Current resources
"jaccard": {
"mem": "48g",
"threads": "2",
"time": "01-00:00:00"
},
Proposed changes
"jaccard": {
"mem": "40g",
"threads": "32",
"time": "02-00:00:00"
},
Memory and time can probably be dropped with the increase in threads; will follow up after re-running current samples through with these changes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.