Coder Social home page Coder Social logo

eltebioinformatics / gmt_files_for_mulea Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 982.88 MB

GMT files for the mulea R package

Home Page: https://www.biorxiv.org/content/10.1101/2024.02.28.582444v1

Python 76.81% R 3.41% Shell 19.78%
gene-set-enrichment gene-sets ontologies

gmt_files_for_mulea's Introduction

GMT files for mulea

GitHub issues GitHub pulls

This repository provides ready-to-use gene sets formatted in the standardized Gene Matrix Transposed (GMT) format, compatible with the mulea R package, a comprehensive tool for overrepresentation and functional enrichment analysis.

The GMT format is a tab-delimited text file used to represent collections of genes or proteins associated with specific ontology entries. Each row in a GMT file corresponds to a single ontology element and comprises three main columns:

  1. Ontology identifier: This column uniquely identifies the element within the referenced ontology.

  2. Ontology name or description: This column provides a user-friendly label or textual description for the ontology element.

  3. List of associated genes/proteins: This column lists the gene or protein identifiers belonging to the corresponding ontology element, separated by spaces.

Within the mulea package, these entities are referred to as ontology_id, ontology_name, and list_of_values, respectively. Additionally, rows starting with a “#” symbol in the GMT file are considered comment lines and may contain supplementary information about the referenced ontology, such as its type, source, species, version, and identifier.

This repository offers pre-processed gene sets for 27 model organisms (from Escherichia coli to human) with various identifiers including UniProt, Entrez, Gene Symbol, and Ensembl IDs.

The GMT files can be found in the GMT_files folder, and the scripts we applied to create them are available in the scripts_to_create_GMT_files folder. Also, there is a script for mapping between different ID types in the scripts_to_create_GMT_files/ID_mapping_scripts folder.

The GMT files can be downloaded and read with the mulea::read_gmt() function. i.e.

mulea::read_gmt(file = "Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")

Or can be loaded directly from this github repository. i.e.

mulea::read_gmt(file = "https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Drosophila_melanogaster_7227/Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")

Besides, we also created the muleaData ExperimentHubData Bioconductor package to ease browsing and reading the ontologies.

List of species we cover:

  • Arabidopsis thaliana
  • Bacillus subtilis
  • Bacteroides thetaiotaomicron VPI-5482
  • Bifidobacterium longum
  • Bos taurus
  • Caenorhabditis elegans
  • Chlamydomonas reinhardtii
  • Danio rerio
  • Daphnia pulex
  • Dictyostelium discoideum
  • Drosophila melanogaster
  • Drosophila simulans
  • Escherichia coli
  • Gallus gallus
  • Homo sapiens
  • Macaca mulatta
  • Mus musculus
  • Mycobacterium tuberculosis
  • Neurospora crassa
  • Pan troglodytes
  • Rattus norvegicus
  • Saccharomyces cerevisiae
  • Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
  • Schizosaccharomyces pombe
  • Tetrahymena thermophila
  • Xenopus tropicalis
  • Zea mays

Type, name, link and citation of the databases we cover:

Ontology category Ontology name Short description of content Reference
Gene expression FlyAtlas Tissue-specific expression data for Drosophila melanogaster. Chintapalli,V.R. et al. (2007) Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nat Genet, 39, 715–720.
ModEncode Functional characterization (cell line, temporal expression, tissue expression, treatment) of elements for Caenorhabditis elegans and Drosophila melanogaster. The Modencode Consortium et al. (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330, 1787–1797.
Genomic location Chromosomal Bands Location of genes on the chromosome. Martin,F.J. et al. (2023) Ensembl 2023. Nucleic Acids Res, 51, D933–D941.
Consecutive genes n consecutive genes on the chromosome.
miRNA regulation miRTarBase Experimentally validated miRNA - target interactions. Huang,H.-Y. et al. (2022) miRTarBase update 2022: an informative resource for experimentally validated miRNA–target interactions. Nucleic Acids Res, 50, D222–D230.
Gene Ontology GO Gene Ontology (GO) categorizes genes into unified categories and attributes. The Gene Ontology Consortium et al. (2023) The Gene Ontology knowledgebase in 2023. Genetics, 224, iyad031.
Pathway Pathway Commons Collection of biological pathway and interaction data. Rodchenkov,I. et al. (2020) Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res, 48, D489–D497.
Reactome Collection of biological pathway and interaction data. Jassal,B. et al. (2020) The reactome pathway knowledgebase. Nucleic Acids Res, 48, D498–D503.
Signalink Interaction database focussing on pathways and interactions of pathways. Csabai,L. et al. (2022) SignaLink3: a multi-layered resource to uncover tissue-specific signaling networks. Nucleic Acids Res, 50, D701–D709.
Wikipathways Collection of biological pathway and interaction data. Martens,M. et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res, 49, D613–D621.
Protein domain PFAM Protein domain structure database. Mistry,J. et al. (2021) Pfam: The protein families database in 2021. Nucleic Acids Res, 49, D412–D419.
Transcription factor regulation ATRM Transcription factor - target gene interactions for Arabidopsis thaliana. Jin,J. et al. (2015) An Arabidopsis transcriptional regulatory map reveals distinct functional and evolutionary features of novel transcription factors. Mol Biol Evol, 32, 1767–1773.
dorothEA Transcription factor - target gene interactions for human and mouse. Garcia-Alonso,L. et al. (2019) Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res, 29, 1363–1375.
RegulonDB Transcription factor - target gene interactions for Escherichia coli bacteria. Tierrafría,V.H. et al. (2022) RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb Genom, 8, 000833.
TFLink Small- and large-scale transcription factor - target gene interactions for human and 6 model organisms. Liska,O. et al. (2022) TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species. Database, 2022, baac083.
TRRUST Transcription factor - target gene interactions for human. Han,H. et al. (2018) TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res, 46, D380–D386.
Yeastract Transcription factor - target gene interactions for Saccharomyces cerevisiae. Teixeira,M.C. et al. (2018) YEASTRACT: an upgraded database for the analysis of transcription regulatory networks in Saccharomyces cerevisiae. Nucleic Acids Res, 46, D348–D353.

Citation

To cite the GMT files in publications use:

Ari E, Ölbei M, Gul L, Bohár B, Stirling T (2024). muleaData: ExperimentalData Bioconductor Package for the mulea R Package, Contains Genes Sets for Functional Enrichment Analysis in GMT File Format. R package version 0.99.0, https://github.com/ELTEbioinformatics/muleaData.

gmt_files_for_mulea's People

Contributors

bbazsi41 avatar olbeimarton avatar

Watchers

 avatar  avatar

gmt_files_for_mulea's Issues

Folder structire

Please create 2 main folders:

  • GMT_files for the GMT folders and files
  • scripts_to_create_GMT_files for the scripts

GMTs with a sinlge entry

Please delete GMT files having less than 5 entries (lines). ie.

  • Pathways_SignaLink_Drosophila_melanogaster_EntrezID.gmt
  • Pathways_SignaLink_Drosophila_melanogaster_UniprotID.gmt
  • Pathways_SignaLink_Drosophila_melanogaster_EnsemblID.gmt
  • miRNA_regulation_miRTarBase_Gallus_gallus_EntrezID.gmt
  • GO_CC_Saccharomyces_cerevisiae_EnsemblID.gmt

Or having entries with single genes (except for 1 o 2 entries) only. i.e.

  • Transcription_factor_TFLink_Rattus_norvegicus_SS_GeneSymbol.gmt
  • Transcription_factor_TFLink_Rattus_norvegicus_SS_EntrezID.gmt
  • Transcription_factor_TFLink_Rattus_norvegicus_SS_UniprotID.gmt
  • Transcription_factor_TFLink_Rattus_norvegicus_SS_EnsemblID.gmt
  • GO_CC_Saccharomyces_cerevisiae_entrezID.gmt

Please update the 3rd and the 4th sheet of the "mulea Supplementary Table 1" accordingly

duplicated files

muleaData/GO_MF_Caenorhabditis_elegans_EntrezID 3.rds
muleaData/GO_MF_Caenorhabditis_elegans_EntrezID.rds

NA-s in the header of GMT files of Salmonella_enterica_subsp_enterica_serovar_Typhimurium_str_LT2_99287

For example:

# Gene set GMT file for mulea Bioconductor R package
NA
NA
# ID_type: EntrezID
# source_url: www.ensembl.org 
# source_PMID: 34791404
# source_primary_ID: EnsemblID / LocusID 
# source_version: 109
# source_last_update: 2023
# gmt_download_date: 02-12-2022
# gmt_version: 1
# gmt_entry_names: chromosome location
# chromosome location chromosome location Genes

NA
NA

NA
NA

The first 2 NAs should be:

# taxon_name: Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
# taxonomy_ID: 99287

The further NAs should be deleted.

Bacteroides thetaiotaomicron taxon issues

The folder is called Bacteroides_thetaiotaomicron_VPI_5482_226186
while in the header of GMTs

# taxon_name: Bacteroides thetaiotaomicron
# taxonomy_ID: 818

Please rewrite the headers like this:

# taxon_name: Bacteroides thetaiotaomicron VPI-5482
# taxonomy_ID: 226186

Transcription_factor_ATRM_Arabidopsis_thaliana_EnsemblID.gmt 2

There is a GMT file called "Transcription_factor_ATRM_Arabidopsis_thaliana_EnsemblID.gmt 2" besides the ""
The content of the 2 files differs*
Please delete one of them and make the extension to GMT.

diff Transcription_factor_ATRM_Arabidopsis_thaliana_EnsemblID.gmt\ 2 Transcription_factor_ATRM_Arabidopsis_thaliana_EnsemblID.gmt

57c57
< ANAC012 ANAC012 AT1G16490 AT1G17950 AT1G62990 AT1G66230 AT1G73410 AT1G79180 AT4G12350 AT5G12870 AT5G16600 AT5G56110

ANAC012 ANAC012 AT1G16490 AT1G17950 AT1G62990 AT1G66230 AT1G73410 AT1G79180 AT4G12350 AT4G22680 AT4G33450 AT5G12870 AT5G16600 AT5G56110
62,63c62,63
< BZIP60 BZIP60 AT1G09080 AT2G31955 AT5G28540 AT5G42020
< ABF2 ABF2 AT1G77120 AT2G18050 AT5G20830 AT5G52300 AT5G52310


BZIP60 BZIP60 AT1G09080 AT2G31955 AT5G20990 AT5G28540 AT5G42020
ABF2 ABF2 AT1G77120 AT5G20830 AT5G52300 AT5G52310
108c108
< AMS AMS AT1G13140 AT1G59740 AT1G66850 AT1G67990 AT1G73220 AT1G75790 AT1G75920 AT3G13220AT3G28740 AT3G51590 AT4G00040 AT5G17050 AT5G49070


AMS AMS AT1G13140 AT1G59740 AT1G66850 AT1G67990 AT1G73220 AT1G75920 AT3G13220 AT3G28740AT3G51590 AT4G00040 AT5G17050 AT5G49070
148c148
< EMB2301 EMB2301 AT1G16490 AT1G62990 AT1G79180 AT5G12870 AT5G56110


EMB2301 EMB2301 AT1G16490 AT1G62990 AT1G79180 AT4G33450 AT5G12870 AT5G56110
267c267
< DREB2A DREB2A AT1G01470 AT1G52690 AT2G41190 AT2G42540 AT3G12580 AT3G17520 AT3G50970 AT4G33720 AT5G52300 AT5G52310


DREB2A DREB2A AT1G01470 AT1G52690 AT2G41190 AT2G42540 AT3G12580 AT3G17520 AT3G50970 AT5G52300 AT5G52310
273c273
< WRKY26 WRKY26 AT1G63650 AT5G60890


WRKY26 WRKY26 AT1G63650 AT3G55730 AT5G60890
282c282
< MYB46 MYB46 AT1G16490 AT1G62990 AT1G79180


MYB46 MYB46 AT1G16490 AT1G62990 AT1G79180 AT4G22680
296c296
< ARF7 ARF7 AT1G04240 AT1G19220 AT2G42430 AT2G45420 AT3G15540 AT3G20840 AT3G50340 AT3G58190AT4G14550 AT4G14560 AT4G37390 AT4G37650


ARF7 ARF7 AT1G04240 AT1G19220 AT2G42430 AT2G45420 AT3G15540 AT3G20840 AT3G58190 AT4G14550AT4G14560 AT4G37390 AT4G37650
315c315
< ERF2 ERF2 AT1G06160 AT1G72260 AT5G44420


ERF2 ERF2 AT1G06160 AT1G72260 AT3G55730 AT5G44420

TFLink_ALL_LS_human.zip unzipping error

In the Homo_sapiens_9606 folder the TFLink_ALL_LS_human.zip is unzippable:
the
unzip TFLink_ALL_LS_human.zip
command gives the following error message:

Archive: TFLink_ALL_LS_human.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of TFLink_ALL_LS_human.zip or
TFLink_ALL_LS_human.zip.zip, and cannot find TFLink_ALL_LS_human.zip.ZIP, period.

Delete empty GMT files

There are empty GMT files i.e.

  • Protein_domain_PFAM_Bacteroides_thetaiotaomicron_VPI_5482_EntrezID.gmt
  • Protein_domain_PFAM_Bacteroides_thetaiotaomicron_VPI_5482_LocusID.gmt
  • Genomic_location_Ensembl_Daphnia_pulex_5genes_EntrezID.gmt
  • Genomic_location_Ensembl_Daphnia_pulex_10genes_EntrezID.gmt
  • Genomic_location_Ensembl_Daphnia_pulex_20genes_EntrezID.gmt
  • GO_BP_Saccharomyces_cerevisiae_entrezID.gmt

Please check why are these empty. Because of mapping error or missing IDs? If these IDs cannot be mapped please delete the empty GMTs. If can be mapped, please remap and check.

Please update the 4th sheet of the "mulea Supplementary Table 1" accordingly

KEGG script

@olbeimarton can you add the script to download KEGG data to the scripts_to_create_GMT_files/KEGG folder?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.