Coder Social home page Coder Social logo

ccbr / aspen Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 2.0 631.58 MB

CCBR pipeline for preliminary QC and peak calling from ATACseq datasets

Home Page: https://ccbr.github.io/ASPEN/

License: MIT License

Shell 41.79% Python 40.40% R 17.81%
atac-seq-pipeline ngs-pipeline macs2 genrich multiqc

aspen's People

Contributors

kelly-sovacool avatar kopardev avatar slsevilla avatar

Watchers

 avatar  avatar

aspen's Issues

jaccard issue

Issue with jaccard: genrich uses a non-canonical sorting for peak files (i.e. not name sorted,but numeric by chromosome).

This causes jaccard on consensus files step to fail in comparing samples with only 1 replicate to those with replicates. For the consensus peak file for samples with replicates (name sorted), while it's not for the single replicate. To be on safe side -- do a bedsort on the consensus bedfile in cases where nrep=1

(e.g. line 186 of ccbr_atac_genrich_peak_calling.bash cut -f1-3 $PEAKFILE1 | sort -k1,1 -k2,2n > $CONSENSUSBEDFILE)

add contrast comparisons between peaks

  • updating config to take in contrast.tsv
  • create rules to run deeptools on samples (correlation matrix, heatmaps, profiles, summary)
  • create rules to create counts matrix
  • create rules to perform DESEQ between contrasts - output a diffresults.txt
  • create rules to perform conversion of diffresults to bed file

Need to add single-end option

Problem

Currently the pipeline is only capable of running paired-end reads. Have a project that is single-end. Will update pipeline to run this project.

`runinfo.yaml` stat error

stat error is displayed if runinfo.yaml file does not exist
Not a problem... but "errors" make it look bad!

warning during `git clone`: large files should be handled with `git lfs`

Warning message when attempting to clone this repo:

Cloning into 'ASPEN'...
remote: Enumerating objects: 1103, done.
remote: Counting objects: 100% (200/200), done.
remote: Compressing objects: 100% (131/131), done.
remote: Total 1103 (delta 71), reused 158 (delta 64), pack-reused 903
Receiving objects: 100% (1103/1103), 645.99 MiB | 3.98 MiB/s, done.
Resolving deltas: 100% (604/604), done.
Updating files: 100% (125/125), done.
Encountered 4 files that should have been pointers, but weren't:
        resources/blacklistFa/hs1.blacklist.fa.gz
        resources/frip/hs1.DHS.bed.gz
        resources/frip/hs1.enhancers.bed.gz
        resources/tssBed/hs1_tssbeds.tar.gz

This causes an error when trying to commit other unrelated changes:

An unexpected error has occurred: CalledProcessError: command: ('/Library/Developer/CommandLineTools/usr/libexec/git-core/git', '-c', 'core.autocrlf=false', 'apply', '--whitespace=nowarn', '/Users/sovacoolkl/.cache/pre-commit/patch1704842321-37084')
return code: 1
stdout: (none)
stderr:
    error: the patch applies to 'resources/tssBed/hs1_tssbeds.tar.gz' (555ec58cde1546d9624e100af8985918dfcfbc3e), which does not match the current contents.
    error: resources/tssBed/hs1_tssbeds.tar.gz: patch does not apply
Check the log at /Users/sovacoolkl/.cache/pre-commit/pre-commit.log

Looks like something went wrong when these files were initially commited?

aggregate error in DiffATAC reported by Krithika

log files shows:

Activating singularity image /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/.snakemake/singularity/f1377
d9e7c36f023245c48c6243ba596.simg
WARNING: While bind mounting '/data/CCBR_Pipeliner:/data/CCBR_Pipeliner': destination is already in the mount
 point list
+ TMPDIR=/lscratch/16386359
+ '[' '!' -d /lscratch/16386359 ']'
++ dirname /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/degs.done
+ outdir=/vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ cd /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC
+ Rscript /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/scripts/aggregate_results_runner.R --countsmatr
ix /vf/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/ROI.counts.tsv --diffatacdir /vf
/users/CCRCCDI/analysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC --coldata /vf/users/CCRCCDI/a
nalysis/ccrtegs4/atac_test/test2/results/peaks/genrich/DiffATAC/sampleinfo.txt --foldchange 2 --fdr 0.05 --in
dexcols Geneid --excludecols Chr,Start,End,Strand,Length --diffatacdir /vf/users/CCRCCDI/analysis/ccrtegs4/at
ac_test/test2/results/peaks/genrich/DiffATAC --tmpdir /lscratch/16386359 --scriptsdir /vf/users/CCRCCDI/analy
sis/ccrtegs4/atac_test/test2/scripts


processing file: aggregate_results.Rmd

Quitting from lines 72-102 [allsamplepca] (aggregate_results.Rmd)
Error in `vst()`:
! less than 'nsub' rows,
  it is recommended to use varianceStabilizingTransformation directly
Backtrace:
 1. DESeq2::vst(dds1)
Execution halted

add Counts Matrix to workflow

Ask: add countsmatrix to the workflow

Plan:

  • take output from atac_genrich_peakcalling_fixed_width :
    renormalizedConsensusNarrowPeak=join(RESULTSDIR,"peaks","genrich","fixed_width","{sample}.renormalized.fixed_width.consensus.narrowPeak")
  • run python script
    python {params.pyscript} --bedbedgraph {input.inputs} --tmpdir $TMPDIR --countsmatrix {output.cm} --fragmentscountsmatrix {output.fcm} --sampleinfo {output.si}

ASAP: Deeptools peaks

Use reads pooled bw files to generate heatmaps using deeptools for:

  • around TSS
  • metagene

Change location of where the pipeline reads cluster.yaml

Currently:
Pipeline reads from /PIPELINEHOME/cluster.yaml file instead of /WORKDIR/cluster.yaml

Problem:
If the user wants to change params for a specific project (IE increase the resources for larger inputs) then they can only current do this within the pipeline home dir. Instead this should be done at a per-project-level

sample error with samples > 39

I kicked off a run with yesterday without a problem (N=17). Tried to do a second run with more samples (N=40) and keep getting an error during init.smk.

I've been trying t troubleshoot it and the only thing that I can figure out is if I run it with 39 samples it runs fine, but as soon as I had the remaining sample, it errors. Doesn't matter which sample is at the end, always errors.

Specifically, the error is happening during init.smk, when its going through the reps to make sure that R1/R2 exist. I edited it to print out the rep, r1, r2 values

this works as normal for sample N=38:

rep:  NHP27_TP4_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz

but then, after when it gets to sample N=39 it does this:

rep:  NHP21_TP1_CD14negCD16pos
R1:  replicateName
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R1_fastq, dtype: object
R2:  replicateName
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
NHP21_TP1_CD14negCD16pos    /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s...
Name: path_to_R2_fastq, dtype: object

It seems to be repeating the entire df again. If I run with N=39 it prints as expected.

rep:  NHP27_TP4_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s788_r2.fq.gz
rep:  NHP21_TP1_CD14negCD16pos
R1:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r1.fq.gz
R2:  /data/NCI_VB/rawdata/CCRVB-13/190321_0539/fq/s791_r2.fq.gz

I've also printed out the replicates, and those print as expected for both
N=39

['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']

N=40

['NHP17_TP1_CD14posDRneg', 'NHP17_TP1_CD14posDRpos', 'NHP17_TP2_CD14posDRneg', 'NHP17_TP2_CD14posDRpos', 'NHP17_TP3_CD14posDRneg', 'NHP17_TP3_CD14posDRpos', 'NHP17_TP4_CD14posDRneg', 'NHP17_TP4_CD14posDRpos', 'NHP10_TP2_CD14posDRneg', 'NHP10_TP2_CD14posDRpos', 'NHP10_TP3_CD14posDRneg', 'NHP10_TP3_CD14posDRpos', 'NHP10_TP4_CD14posDRneg', 'NHP10_TP4_CD14posDRpos', 'NHP27_TP1_CD14posDRneg', 'NHP27_TP1_CD14posDRpos', 'NHP27_TP2_CD14posDRneg', 'NHP27_TP2_CD14posDRpos', 'NHP27_TP3_CD14posDRneg', 'NHP27_TP3_CD14posDRpos', 'NHP27_TP4_CD14posDRneg', 'NHP27_TP4_CD14posDRpos', 'NHP21_TP1_CD14posDRneg', 'NHP21_TP1_CD14posDRpos', 'NHP22_TP1_CD14posDRneg', 'NHP22_TP1_CD14posDRpos', 'NHP17_TP1_CD14negCD16pos', 'NHP17_TP2_CD14negCD16pos', 'NHP17_TP3_CD14negCD16pos', 'NHP17_TP4_CD14negCD16pos', 'NHP10_TP1_CD14negCD16pos', 'NHP10_TP2_CD14negCD16pos', 'NHP10_TP3_CD14negCD16pos', 'NHP10_TP4_CD14negCD16pos', 'NHP27_TP1_CD14negCD16pos', 'NHP27_TP2_CD14negCD16pos', 'NHP27_TP3_CD14negCD16pos', 'NHP27_TP4_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos', 'NHP21_TP1_CD14negCD16pos']

Finally, I have tried removing extra lines and carriages and that hasn't fixed it either. I don't get where the issue is coming from.

Attaching samples file as example.
samples.txt

dryrun error

TypeError in line 39 of /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3/workflow/rules/init.smk:
stat: path should be string, bytes, os.PathLike or integer, not Series

Using ASAP found here

/data/CCBR_Pipeliner/Pipelines/ASAP/v0.5.3

Need an extra pair of eyes, @kopardev!

Check replicate names are unique

Add code to check samples.tsv to ensure that replicate names are unique. If names are not unique error will occur at init.smk during FQ check (see issue #8)

tmp

Summary

  • Project failure at three rules for all samples.
  • Choose one sample to review issue: DGE0_CD3_Wk4_BCG.
  • Rules failing:
    • Error in rule atac_fld:
    • Error in rule atac_genrich_peakcalling:
    • Error in rule atac_macs_peakcalling:

Project location

/data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20

Rule code for atac_fld

cat /data/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/logs/54331095.54336662.atac_fld.replicate=DGE0_CD3_Wk4_BCG.err

python /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/scripts/ccbr_atac_bam2FLD.py -i /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGE0_CD3_Wk4_BCG.dedup.bam -o /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/QC/fld/DGE0_CD3_Wk4_BCG.fld.txt

Review input; no reads found

samtools view /gpfs/gsfs10/users/NCI_VB/franchini/ccrvb19_Max_covid_ATACseq_KCG/ASAP_221215_subset20/results/dedupBam/DGP4_CD3_Wk4_BCG.dedup.bam

Review QC for bam file

cat DGE0_CD3_Wk4_BCG.bowtie2.bam.flagstat
67647231 + 0 in total (QC-passed reads + QC-failed reads)
16561337 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66637105 + 0 mapped (98.51% : N/A)
51085894 + 0 paired in sequencing
25542947 + 0 read1
25542947 + 0 read2
48207136 + 0 properly paired (94.36% : N/A)
49686192 + 0 with itself and mate mapped
389576 + 0 singletons (0.76% : N/A)
92274 + 0 with mate mapped to a different chr
38075 + 0 with mate mapped to a different chr (mapQ>=5)

read access not given for v.0.6.1

Currently not all files within this path have read access, and therefore the pipeline cannot run:

/data/CCBR_Pipeliner/Pipelines/ASAP/v0.6.1

@kopardev - please update permissions. I've created a tmp dir in the meantime which has read access (and write access) that can be deleted once this is changed.

Thanks!

jaccard time out; max out on CPUS

Running jaccard on recent mmul dataset has lead to time out of samples. Current resource allocation should be adjusted
Current resources

        "jaccard": {
                "mem": "48g",
                "threads": "2",
                "time": "01-00:00:00"
        },

Proposed changes

        "jaccard": {
                "mem": "40g",
                "threads": "32",
                "time": "02-00:00:00"
        },

Memory and time can probably be dropped with the increase in threads; will follow up after re-running current samples through with these changes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.