openomics / chrom-seek Goto Github PK
View Code? Open in Web Editor NEWAn awesome set of epigenetic pipelines for bulk cfChip-seq, ChIP-seq, and ATAC-seq
Home Page: https://openomics.github.io/chrom-seek/
License: MIT License
An awesome set of epigenetic pipelines for bulk cfChip-seq, ChIP-seq, and ATAC-seq
Home Page: https://openomics.github.io/chrom-seek/
License: MIT License
It would be awesome if we had a script to automate the process of building reference files for different/new organisms, something where we could give it a minimal set of inputs (such as a genomic fasta file, annotation, and an optional blacklist, etc.) and it would handle the rest.
At the current moment, the pipeline is using environment modules. We need to build docker images for all the software dependencies of the pipeline.
The new features to be added to the cfChIP assay include:
This is step completes basic functional pipeline for cfChIP.
SICER2 has information pertaining to genome chromosome lengths hard-coded in the following file:
https://github.com/zanglab/SICER2/blob/master/sicer/lib/GenomeData.py
To add support for a new reference genome, a user must do the following:
I emailed the author of SICER2 asking if it would be possible to implement any of the suggestions here. It is worth also noting, if one were to make these changes, one should also update the dtype
set for chr
. At the current moment, it is set to U6
or a unicode string of length 6, which may not be long enough in some edge-cases (could cause a string to get truncated and cause collisions).
For the time being, we will turn off sicer2 when the reference genome is rheMac10; however, we should explore other solutions if the author cannot implement this on their side:
sicer/lib/GenomeData.py
with rheMac10 information and build docker imageCurrently using this code as a fix for ppqt rule to run the chrom-seek pipeline, but need to find a more straightforward solution.
if paired_end:
extensionsPPQT={"sorted": "bam", "Q5DD":"bam"}
else:
extensionsPPQT= {"sorted": "bam", "tagAlign": "gz", "Q5DD": "tagAlign.gz"}
The following issue was discovered while running run_spp.R from ppqt
:
The error indicates that the R package, caTools, was missing from the docker image. Initially, I could not reproduce this behavior, and from my side, it appeared that caTools was installed:
$ module purge
$ ml singularity
$ singularity exec -B $PWD /data/OpenOmics/SIFs/ppqt_v0.2.0.sif Rscript -e 'library(caTools); sessionInfo()'
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] caTools_1.18.2
loaded via a namespace (and not attached):
[1] compiler_4.1.2 bitops_1.0-7
After more debugging, it appeared that I could reproduce the error if I module load R
before running singularity:
$ module purge
$ module load R
$ ml singularity
$ singularity exec -B $PWD /data/OpenOmics/SIFs/ppqt_v0.2.0.sif Rscript -e 'library(caTools); sessionInfo()'
Error in library(caTools) : there is no package called ‘caTools’
Execution halted
It appears the following environment variable, R_LIBS_SITE
, is exported when R is module loaded into a user's $PATH. If that environment variable is set, that will update what .libPaths R internally uses to find R packages. This can cause issues with how the R installation within the docker image finds/loads packages.
The following can be used to fix the problem.
$ module purge
$ module load R
$ ml singularity
$ R_LIBS_SITE='' singularity exec -B $PWD /data/OpenOmics/SIFs/ppqt_v0.2.0.sif Rscript -e 'library(caTools); sessionInfo()'
...
However, it would be better to run singularity with the --containall, -C option
. This would prevent this problem in the future and other issues that can occur from sharing filesystems and env variables with the host.
Hi,
I'm trying to test the pipeline using the test files but all files on https://github.com/OpenOmics/chrom-seek/tree/main/.tests are empty except the the peakcall.tsv and the contrasts.tsv. Can you have some working test files in the github?
Best
Incorporate Subrata's code into the pipeline
Sicer kills the dry-run when there are samples lacking input controls.
DiffBind with blocking currently errors out. This causes both DiffBind block and regular DiffBind to fail within the pipeline. Error seems to be that the csv file being used does not contain the blocking information.
Make the following changes to how the --peakcall PEAKCALL
file is parsed and validated:
For ATAC,
The following can be considered to probably be conditional when running the pipeline:
These are suggestions not demands,
Paul
--rerun-trigger
to run sub commandThis adds a mechanism to control how snakemake decides what rules to re-run. By default, this option is set to the following: ['mtime', 'params', 'input', 'software-env', 'code']
which can cause upstream rules to re-run unexpectedly.
Add something to the docs stating if this occurs, then this should be set to --rerun-trigger mtime
. This allows for the user to decide how/why rules are re-run.
For ATAC part,
ppqt can fail causing a lot of downstream issues which can lead to bigwigs not being made. If this not necessary for bigwig formation, this should be removed.
Paul
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.