epigen / enrichment_analysis Goto Github PK

A Snakemake workflow for performing genomic region set and gene set enrichment analyses using LOLA, GREAT, GSEApy, pycisTarget and RcisTarget.

Home Page: https://epigen.github.io/enrichment_analysis/

License: MIT License

Python 62.37% R 37.63%

bioinformatics genomic-regions enrichment-analysis atac-seq biomedical-data-science chip-seq gene-set-enrichment gene-sets rna-seq visualization

enrichment_analysis's People

Contributors

Stargazers

Watchers

enrichment_analysis's Issues

Error in rule gene_ORA_GSEApy: jobid: 31, raise ValueError("No objects to concatenate")

Hi!

Do you have an idea why your pipeline fails always on this step? :)
I've tried 2 different inputs and there is always problem on that step 32 of 81.

Below is the part of the console output with the error message:

`Warning message:
package ‘svglite’ was built under R version 4.1.3 
[Fri May 26 18:19:55 2023]
Finished job 22.
32 of 81 steps (40%) done
Select jobs to execute...

[Fri May 26 18:19:55 2023]
rule gene_ORA_GSEApy:
    input: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/GREAT/genes.txt, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/background/GREAT/genes.txt, resources/Sorted_cyto_from_bins_2_run/MyDB.json
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv
    log: logs/rules/gene_ORA_GSEApy_UP_sorted_cyto_from_10kb_bins.bed_MyDB.log
    jobid: 31
    reason: Missing output files: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv; Input files updated by another job: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/GREAT/genes.txt, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/background/GREAT/genes.txt
    wildcards: gene_set=UP_sorted_cyto_from_10kb_bins.bed, db=MyDB
    resources: tmpdir=/tmp, mem_mb=32000

python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))"
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
2023-05-26 18:19:57,009 [INFO] Input dict object named with gs_ind_0
2023-05-26 18:19:57,009 [INFO] Run: gs_ind_0 
2023-05-26 18:19:57,010 [INFO] No hits return, for gene set: Custom140555313936960
Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py", line 91, in <module>
    res = gp.enrich(gene_list=gene_list,
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/gseapy/__init__.py", line 607, in enrich
    enr.run()
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/gseapy/enrichr.py", line 534, in run
    self.results = pd.concat(self.results, ignore_index=True)
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 274, in concat
    op = _Concatenator(
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 331, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
[Fri May 26 18:19:57 2023]
Error in rule gene_ORA_GSEApy:
    jobid: 31
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv
    log: logs/rules/gene_ORA_GSEApy_UP_sorted_cyto_from_10kb_bins.bed_MyDB.log (check log file(s) for error message)
    conda-env: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_

RuleException:
CalledProcessErrorin line 65 of /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/enrichment_analysis.smk:
Command 'source /mnt/polkanowa2/programs/bin/activate '/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_'; set -eo pipefail; python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py' returned non-zero exit status 1.
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/enrichment_analysis.smk", line 65, in __rule_gene_ORA_GSEApy
  File "/mnt/polkanowa2/programs/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-05-26T181430.372545.snakemake.log
(base) root@SRV602-88C9:/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis# `

Here is attached the log file.
2023-05-26T181430.372545.snakemake.log

Please let me know if you need anything else!
Thanks
Sandra

release v1.0.0

note for docs

GREAT
- rGREAT is not using the provided cores although correct parameter -> could not make it work, still provide it as parameter
- region-gene association wihtout background: faster and more genes and will be anyway accounted for in the gene-based analyses
- research & document GREAT statistics: https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655456/Statistics
pycisTarget:
- bakcground regions not used
- point to github page with instruction for custom cisTarget database creation
RcisTarget: uses background

investigate preranked GSEApy behaviour for +/-Inf scores

DEA modules ( and ) calculate feature scores. In the case of raw p-values zero, the -log10 is Inf, hence feature scores are +/- Inf.

Test the behavior and handle the exception in case of undesired behaviors
document it either way.
replace -/+inf with min/max in data

include decoupleR

https://decoupler-py.readthedocs.io/en/latest/
python implementation (faster) for enrichment analysis (nothing else?)
leverage OmniPath: https://omnipathdb.org/

User provided databases

support the functionality to provide your own databases as JSON (= dictionaries) files

The results are not generating

Hi!

It was a great idea to create such tool!
I have an issue and I don't know where might be a root cause that the results are not generating.

When running:
$ snakemake -p --conda-frontend conda --configfile config/config.yaml -c1
I receive:

Config file config/config.yaml is extended by additional config specified via the command line.
Building DAG of [jobs...]
Nothing to be done (all requested files are present and up to date).
Complete log: .snakemake/log/2023-04-06T075124.125638.snakemake.log

In the complete log is the same information which is displayed on the screen above.

I attach the enrichment_analysis_annotation.csv file. Here is a config.yaml content:

# alwayse use absolute paths

##### RESOURCES #####
partition: 'tinyq'
mem: '32000'
threads: 1


##### GENERAL #####
annotation: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/enrichment_analysis_annotation.csv
result_path: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/
project_name: Sorted_cyto_from_10kb_bins

# genome
# human 'hg19' or 'hg38' 
# mouse 'mm9' or 'mm10'
genome: 'hg38'


##### TOOLS #####

### GSEApy - ORA Enrichr (Fisher/hypergeometric test) and preranked GSEA based analysis

# Databases downloaded from Enrichr (https://maayanlab.cloud/Enrichr/#libraries)
# example: enrichr_dbs: ["KEGG_2021_Mouse", "GO_Biological_Process_2021", "WikiPathways_2019_Mouse"]
enrichr_dbs: ["KEGG_2021_Human", "GO_Biological_Process_2021", "WikiPathways_2019_Human"]

# Databases in GMT format containing Gene Symbols e.g, downloaded from MSigDB (http://www.gsea-msigdb.org/gsea/msigdb)
local_gmt_dbs:
    MyMSigDB: "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/databases/msigdb.v2023.1.Hs.symbols.gmt"

# path to local databases as JSON files will be loaded as dictionaries
# example content: { "MyDB_Term1": ["geneA","geneB","geneC"],"MyDB_Term2": ["geneX","geneY","geneZ"]}
local_json_dbs:
    MyDB: "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/databases/c2.cp.wikipathways.v2023.1.Hs.json"

### GREAT - region-gene association based analysis

# databases to be queried from GREAT (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655440/Ontologies)
# not all ontologies are available for all genomes and GREAT versions (here we use version 4)
great_dbs: ['GO Molecular Function','GO Biological Process','GO Cellular Component','Mouse Phenotype','Mouse Phenotype Single KO','Human Phenotype']

### LOLA - region overlap based analysis

# databases to be queried by LOLA (https://databio.org/regiondb)
# not all databases are available for all genomes (eg mm10 only supports LOLACore)
lola_dbs: ['LOLACore','jaspar_motifs','roadmap_epigenomics']

### Enrichment plot

# tool specific column names for aggregation, plotting & summaries
column_names:
    ORA_GSEApy:
        top_n: 25
        p_value: 'P_value'
        adj_pvalue: 'Adjusted_P_value'
        effect_size: 'Odds_Ratio'
        overlap: 'Overlap'
        term: 'Term'
    preranked_GSEApy:
        top_n: 25
        p_value: 'NOM_p_val'
        adj_pvalue: 'FDR_q_val'
        effect_size: 'NES'
        overlap: 'Tag'
        term: 'Term'
    GREAT:
        top_n: 25
        p_value: "HyperP"
        adj_pvalue: "HyperFdrQ"
        effect_size: "RegionFoldEnrich"
        overlap: "TermCov"
        term: "Desc"
    LOLA:
        top_n: 25
        p_value: "pValue"
        adj_pvalue: "qValue"
        effect_size: "oddsRatio"
        overlap: "support"
        term: "description"


# GREAT before
#     GREAT:
#         top_n: 25
#         p_value: "Hyper_Raw_PValue"
#         adj_pvalue: "Hyper_Adjp_BH"
#         effect_size: "Hyper_Fold_Enrichment"
#         overlap: "Hyper_Region_Set_Coverage"
#         term: "name"

##### AGGREGATE & SUMMARIZE #####

# adjusted p-value threshold per tool to denote statistical significance
adjp_th:
    ORA_GSEApy: 0.05
    preranked_GSEApy: 0.05
    GREAT: 0.01
    LOLA: 0.01

# number of top terms per feature set within each group for all overview plots (adjusted p-value, effect-size and bubble-heatmap)
top_terms_n: 5

# cap for adjusted p-value plotting: -log10(adjusted p-value) > adjp_cap -> adjp_cap
adjp_cap: 4

# cap for odds ratio plotting: abs(log2(odds ratio)) > or_cap -> sign(log2(odds ratio)) * or_cap
or_cap: 5

# cap for  normalized enrichemnt scores (NES) abs(nes) > nes_cap -> sign(nes) * nes_cap
# applicable only to preranked_GSEApy
nes_cap: 5

If you need anything else please let me know!
Thanks for your help!

enrichment_analysis_annotation.csv

user-defined database support for LOLA

Make users provide local database paths (no automatic download). Also supports then custom database usage.

include g:profiler from Elixier

https://biit.cs.ut.ee/gprofiler/page/r

Aggregate/visualize not working (because of empty output?)

These jobs remain:

Potentially because of empty combined files(?):

The individual outputs are not empty:

Maybe because nothing is significant?

Enrichment analysis result visualisation using volcano plots (significance by effect-size)

fix group summary dot plot

always one NA col and row
squished plot -> make wider and font far smaller

enhanced visualization considering hierarchical nature of enrichment terms

use this package for simplifying results AND word clouds next to heatmaps!
- http://www.bioconductor.org/packages/release/bioc/html/simplifyEnrichment.html
- uses complexHeatmap
traverse up the GO hierarchy and find most common parent node as systematic high level annotation of multiple GO terms
implement pathway-based visualization
- “old” for mapping results on a (KEGG) pathway: pathview https://pathview.r-forge.r-project.org/
- "new” Systems Biology Graphical Notation includes many more pathway sources, plus vector graphics. SBGNview https://www.bioconductor.org/packages/release/bioc/html/SBGNview.html

Software
Methods
Features

GREAT issues

GREAT issues reported by RtH, I could not replicate them
removing "download_by = tsv" parameter here

enrichment_analysis/workflow/scripts/region_enrichment_analysis_GREAT.R

Line 68 in d5ffc84

tb <- getEnrichmentTables(job, ontology = db, category=NULL, download_by = 'tsv')
jobs spaced out with more delays that might/could help with random "denial of service attack”-prevention? (dozens/100s of simultaneous jobs seem to trigger failures) → check if I can implement some kind of staggered/sleep approach within Snakemake
The GREAT server currently accepts a maximum of 5 API requests for all API users at any given time. More than 5 requests will be denied. To make the maximum of 5 API calls you will need to split your jobs and process them separately. A possibly way of processing the data is seen in our [greatBatchQuery.py](https://great-help.atlassian.net/wiki/download/attachments/655447/greatBatchQuery.py?version=1&modificationDate=1345890224000&cacheVersion=1&api=v2) script. Feel free to adapt it to your needs. from https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655447/Programming+Interface
the most probable cause seems really the many parallel queries. queries are larger when the parameter is =tsv hence it may fail sooner
RtH: yeah, maybe. But I also reduced the jobs to just 5 and it still had the error
RtH: Maybe there is some 'ddos' type security for high volume, high payload jobs, but that also doesn't make sense

update and document all packages to latest version

use the latest version of all packages, re-test & provide versions in environment .yaml files

TFBS/motif enrichment analysis

using Rcistarget (code in macroStim project) or SCENIC packages
input is the same: regions & genes
First for gene lists then region lists (=more complicated?)
https://scenicplus.readthedocs.io/en/latest/
pycisTarget seems promising, but only for regions? maybe mapping from genes to regions is required & easy
pycistarget (Rcistarget) based
with summarization to maxNES (and/or enrGene Numbers? -> config?)

PPI analysis using STRING R package

code in macroStim project
apply to all provided gene sets (e.g., up regulated)
or go beyond and implement "network-based" enrichment analysis using e.g., OMNIPATH testing if genes are closer in the network than random chance / background gene set of expressed genes

raise KeyError(key) from err KeyError: 'qValue' Error in rule aggregate:

Hi!

That's me again 😄
Sorry to bother you but I'm testing the latest version of your pipeline on many different input files and for the different ones I get new error message:

python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))"
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'qValue'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py", line 55, in <module>
    sig_terms = result_df.loc[result_df[adjp_col]<adjp_th, term_col].unique()
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'qValue'
[Mon Jun  5 22:36:57 2023]
Error in rule aggregate:
    jobid: 73
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_all.csv, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_sig.csv
    log: logs/rules/aggregate_mysterySets_LOLA_LOLACore.log (check log file(s) for error message)
    conda-env: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_

RuleException:
CalledProcessErrorin line 19 of /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/aggregate.smk:
Command 'source /mnt/polkanowa2/programs/bin/activate '/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_'; set -eo pipefail; python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py' returned non-zero exit status 1.
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/aggregate.smk", line 19, in __rule_aggregate
  File "/mnt/polkanowa2/programs/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job aggregate since they might be corrupted:
/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_all.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-06-05T222043.715874.snakemake.log

If you need anything else for your testing purposes, please let me know.

Thanks for your help as always! :)
Sandra

include clusterProfiler ecosystem

https://yulab-smu.top/biomedical-knowledge-mining-book/
for ORA/GSEA: clusterProfiler
to visualize the results: enrichplot
The clusterProfiler 4.0 paper, Wu et al., 2021, to get a first overview: https://www.sciencedirect.com/science/article/pii/S2666675821000667
an excellent book written by Guangchuang Yu, the developer. Explains in detail how to run the packages and visualize the results. https://yulab-smu.top/biomedical-knowledge-mining-book/

epigen / enrichment_analysis Goto Github PK

enrichment_analysis's People

Contributors

Stargazers

Watchers

enrichment_analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org