Coder Social home page Coder Social logo

epigen / enrichment_analysis Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 0.0 1.79 MB

A Snakemake workflow for performing genomic region set and gene set enrichment analyses using LOLA, GREAT, and GSEApy.

Home Page: https://epigen.github.io/enrichment_analysis/

License: MIT License

Python 61.26% R 38.74%
bioinformatics genomic-regions enrichment-analysis atac-seq biomedical-data-science chip-seq gene-set-enrichment gene-sets rna-seq visualization

enrichment_analysis's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

enrichment_analysis's Issues

raise KeyError(key) from err KeyError: 'qValue' Error in rule aggregate:

Hi!

That's me again 😄
Sorry to bother you but I'm testing the latest version of your pipeline on many different input files and for the different ones I get new error message:

python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))"
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'qValue'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py", line 55, in <module>
    sig_terms = result_df.loc[result_df[adjp_col]<adjp_th, term_col].unique()
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'qValue'
[Mon Jun  5 22:36:57 2023]
Error in rule aggregate:
    jobid: 73
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_all.csv, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_sig.csv
    log: logs/rules/aggregate_mysterySets_LOLA_LOLACore.log (check log file(s) for error message)
    conda-env: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_

RuleException:
CalledProcessErrorin line 19 of /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/aggregate.smk:
Command 'source /mnt/polkanowa2/programs/bin/activate '/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_'; set -eo pipefail; python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmpn224jb2o.aggregate.py' returned non-zero exit status 1.
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/aggregate.smk", line 19, in __rule_aggregate
  File "/mnt/polkanowa2/programs/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job aggregate since they might be corrupted:
/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/mysterySets/LOLA/LOLACore/mysterySets_LOLACore_all.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-06-05T222043.715874.snakemake.log

If you need anything else for your testing purposes, please let me know.

Thanks for your help as always! :)
Sandra

enhanced visualization considering hierarchical nature of enrichment terms

investigate preranked GSEApy behaviour for +/-Inf scores

DEA modules ( and ) calculate feature scores. In the case of raw p-values zero, the -log10 is Inf, hence feature scores are +/- Inf.

  • Test the behavior and handle the exception in case of undesired behaviors
  • document it either way.

Error in rule gene_ORA_GSEApy: jobid: 31, raise ValueError("No objects to concatenate")

Hi!

Do you have an idea why your pipeline fails always on this step? :)
I've tried 2 different inputs and there is always problem on that step 32 of 81.

Below is the part of the console output with the error message:

`Warning message:
package ‘svglite’ was built under R version 4.1.3 
[Fri May 26 18:19:55 2023]
Finished job 22.
32 of 81 steps (40%) done
Select jobs to execute...

[Fri May 26 18:19:55 2023]
rule gene_ORA_GSEApy:
    input: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/GREAT/genes.txt, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/background/GREAT/genes.txt, resources/Sorted_cyto_from_bins_2_run/MyDB.json
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv
    log: logs/rules/gene_ORA_GSEApy_UP_sorted_cyto_from_10kb_bins.bed_MyDB.log
    jobid: 31
    reason: Missing output files: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv; Input files updated by another job: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/GREAT/genes.txt, /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/background/GREAT/genes.txt
    wildcards: gene_set=UP_sorted_cyto_from_10kb_bins.bed, db=MyDB
    resources: tmpdir=/tmp, mem_mb=32000

python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))"
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py
Activating conda environment: .snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_
2023-05-26 18:19:57,009 [INFO] Input dict object named with gs_ind_0
2023-05-26 18:19:57,009 [INFO] Run: gs_ind_0 
2023-05-26 18:19:57,010 [INFO] No hits return, for gene set: Custom140555313936960
Traceback (most recent call last):
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py", line 91, in <module>
    res = gp.enrich(gene_list=gene_list,
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/gseapy/__init__.py", line 607, in enrich
    enr.run()
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/gseapy/enrichr.py", line 534, in run
    self.results = pd.concat(self.results, ignore_index=True)
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 274, in concat
    op = _Concatenator(
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 331, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
[Fri May 26 18:19:57 2023]
Error in rule gene_ORA_GSEApy:
    jobid: 31
    output: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/enrichment_analysis/UP_sorted_cyto_from_10kb_bins.bed/ORA_GSEApy/MyDB/UP_sorted_cyto_from_10kb_bins.bed_MyDB.csv
    log: logs/rules/gene_ORA_GSEApy_UP_sorted_cyto_from_10kb_bins.bed_MyDB.log (check log file(s) for error message)
    conda-env: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_

RuleException:
CalledProcessErrorin line 65 of /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/enrichment_analysis.smk:
Command 'source /mnt/polkanowa2/programs/bin/activate '/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/conda/db2069d06c67e34fe0d5f2324aefa1c9_'; set -eo pipefail; python /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/.snakemake/scripts/tmp49w6b_kb.gene_ORA_GSEApy.py' returned non-zero exit status 1.
  File "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/workflow/rules/enrichment_analysis.smk", line 65, in __rule_gene_ORA_GSEApy
  File "/mnt/polkanowa2/programs/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-05-26T181430.372545.snakemake.log
(base) root@SRV602-88C9:/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis# `

Here is attached the log file.
2023-05-26T181430.372545.snakemake.log

Please let me know if you need anything else!
Thanks
Sandra

TFBS/motif enrichment analysis

  • using Rcistarget (code in macroStim project) or SCENIC packages
  • input is the same: regions & genes
  • First for gene lists then region lists (=more complicated?)
  • https://scenicplus.readthedocs.io/en/latest/
  • pycisTarget seems promising, but only for regions? maybe mapping from genes to regions is required & easy
  • pycistarget (Rcistarget) based
  • with summarization to maxNES (and/or enrGene Numbers? -> config?)

update and document all packages to latest version

use the latest version of all packages, re-test & provide versions in environment .yaml files

PPI analysis using STRING R package

  • code in macroStim project
  • apply to all provided gene sets (e.g., up regulated)
  • or go beyond and implement "network-based" enrichment analysis using e.g., OMNIPATH testing if genes are closer in the network than random chance / background gene set of expressed genes

The results are not generating

Hi!

It was a great idea to create such tool!
I have an issue and I don't know where might be a root cause that the results are not generating.

When running:
$ snakemake -p --conda-frontend conda --configfile config/config.yaml -c1
I receive:

Config file config/config.yaml is extended by additional config specified via the command line.
Building DAG of [jobs...]
Nothing to be done (all requested files are present and up to date).
Complete log: .snakemake/log/2023-04-06T075124.125638.snakemake.log

In the complete log is the same information which is displayed on the screen above.

I attach the enrichment_analysis_annotation.csv file. Here is a config.yaml content:

# alwayse use absolute paths

##### RESOURCES #####
partition: 'tinyq'
mem: '32000'
threads: 1


##### GENERAL #####
annotation: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/enrichment_analysis_annotation.csv
result_path: /mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/results/
project_name: Sorted_cyto_from_10kb_bins

# genome
# human 'hg19' or 'hg38' 
# mouse 'mm9' or 'mm10'
genome: 'hg38'


##### TOOLS #####

### GSEApy - ORA Enrichr (Fisher/hypergeometric test) and preranked GSEA based analysis

# Databases downloaded from Enrichr (https://maayanlab.cloud/Enrichr/#libraries)
# example: enrichr_dbs: ["KEGG_2021_Mouse", "GO_Biological_Process_2021", "WikiPathways_2019_Mouse"]
enrichr_dbs: ["KEGG_2021_Human", "GO_Biological_Process_2021", "WikiPathways_2019_Human"]

# Databases in GMT format containing Gene Symbols e.g, downloaded from MSigDB (http://www.gsea-msigdb.org/gsea/msigdb)
local_gmt_dbs:
    MyMSigDB: "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/databases/msigdb.v2023.1.Hs.symbols.gmt"

# path to local databases as JSON files will be loaded as dictionaries
# example content: { "MyDB_Term1": ["geneA","geneB","geneC"],"MyDB_Term2": ["geneX","geneY","geneZ"]}
local_json_dbs:
    MyDB: "/mnt/polkanowa2/Cytometh_Bartosz/enrichment_analysis/enrichment_analysis/databases/c2.cp.wikipathways.v2023.1.Hs.json"

### GREAT - region-gene association based analysis

# databases to be queried from GREAT (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655440/Ontologies)
# not all ontologies are available for all genomes and GREAT versions (here we use version 4)
great_dbs: ['GO Molecular Function','GO Biological Process','GO Cellular Component','Mouse Phenotype','Mouse Phenotype Single KO','Human Phenotype']

### LOLA - region overlap based analysis

# databases to be queried by LOLA (https://databio.org/regiondb)
# not all databases are available for all genomes (eg mm10 only supports LOLACore)
lola_dbs: ['LOLACore','jaspar_motifs','roadmap_epigenomics']

### Enrichment plot

# tool specific column names for aggregation, plotting & summaries
column_names:
    ORA_GSEApy:
        top_n: 25
        p_value: 'P_value'
        adj_pvalue: 'Adjusted_P_value'
        effect_size: 'Odds_Ratio'
        overlap: 'Overlap'
        term: 'Term'
    preranked_GSEApy:
        top_n: 25
        p_value: 'NOM_p_val'
        adj_pvalue: 'FDR_q_val'
        effect_size: 'NES'
        overlap: 'Tag'
        term: 'Term'
    GREAT:
        top_n: 25
        p_value: "HyperP"
        adj_pvalue: "HyperFdrQ"
        effect_size: "RegionFoldEnrich"
        overlap: "TermCov"
        term: "Desc"
    LOLA:
        top_n: 25
        p_value: "pValue"
        adj_pvalue: "qValue"
        effect_size: "oddsRatio"
        overlap: "support"
        term: "description"


# GREAT before
#     GREAT:
#         top_n: 25
#         p_value: "Hyper_Raw_PValue"
#         adj_pvalue: "Hyper_Adjp_BH"
#         effect_size: "Hyper_Fold_Enrichment"
#         overlap: "Hyper_Region_Set_Coverage"
#         term: "name"

##### AGGREGATE & SUMMARIZE #####

# adjusted p-value threshold per tool to denote statistical significance
adjp_th:
    ORA_GSEApy: 0.05
    preranked_GSEApy: 0.05
    GREAT: 0.01
    LOLA: 0.01

# number of top terms per feature set within each group for all overview plots (adjusted p-value, effect-size and bubble-heatmap)
top_terms_n: 5

# cap for adjusted p-value plotting: -log10(adjusted p-value) > adjp_cap -> adjp_cap
adjp_cap: 4

# cap for odds ratio plotting: abs(log2(odds ratio)) > or_cap -> sign(log2(odds ratio)) * or_cap
or_cap: 5

# cap for  normalized enrichemnt scores (NES) abs(nes) > nes_cap -> sign(nes) * nes_cap
# applicable only to preranked_GSEApy
nes_cap: 5

If you need anything else please let me know!
Thanks for your help!

enrichment_analysis_annotation.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.