bridgedb / datasources Goto Github PK

4.0 8.0 8.0 133 KB

Repository with the BridgeDb data source.

License: Creative Commons Zero v1.0 Universal

Python 100.00%

datasources's Introduction

datasources

Repository with the BridgeDb data source. The reason to abstract this out is that other tools no longer depend on updates of the BridgeDb Java library to use the information.

The following URLs can be used in downstream tools:

It includes interoperability layers with identifiers.org and Bioregistry.io.

Testing

There are tests for data integrity that can be run with the following commands in the shell:

$ pip install tox
$ tox

datasources's People

Contributors

Stargazers

Watchers

Forkers

finterly biogeek ghostintheshellarise denisesl22 cthoyt tabbassidaloii mkutmon ether3ric

datasources's Issues

Minor: column names are not great

https://github.com/bridgedb/datasources/blob/main/organisms.tsv

First two columns are really correct. Better headers would be: genus, species

Can we change these or will that mess up downstream tools?

Miriam identifiers for Gramene sources in datasources.txt

In datasources.txt, Gramene Genes should have the Miriam identifier urn:miriam:gramene.gene.

Should we create new Miriam identifiers also for the following rows?

Gramene Arabidopsis
Gramene Rice
Gramene Maize
Rice Ensembl Gene

ensure alignment with Identifiers.org

update the namespace for HGNC and Ensembl

e.g. https://www.genenames.org/data/hgnc_data.php?hgnc_id= should be replaced with https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/

Add Bioregistry prefix column

It's great you're curating a registry here! I saw it's linking out to miriam prefixes so I figured it would be good to include Bioregistry ones as well. This way it would be easier to monitor any potentially novel contributions here and suck them back in to the Bioregistry

I can write a script that does this based on Miriam and Wikidata mappings to demonstrate.

Edit: while working on this, I noticed there were several invalid MIRIAM prefixes, which lead to #23

figure out the state of EcoGene

http://ecogene.org/ seems to be offline

Many non https addresses in the tsv files

I browsed the new tsv files and saw that there are many non-https addresses in there. Probably quite a few of these can be replaced.

Look into bioregistry prefix for OpenTargets

figure out new URL pattern of MatrixDB

Originally posted as #8

Incorrect linkout patterns for Gramene Rice and Rice Ensembl Gene

I noticed a linkout that fails for a GeneProduct in the Geranylgeranyldiphosphate biosynthesis II pathway WP2211 at WikiPathways. The GeneProduct has a datasource of "Gramene Rice" and an identifier of "LOC_Os04g56210", as highlighted in green here.

Based on datasources.txt, the linkout should be
http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=LOC_Os04g56210`
but that link gives the following error:

  Database Error

  Could not connect to the core;g=LOC_Os04g56210 database.

  This view requires a gene, transcript or protein identifier in the URL. For example:

  http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=OS05G0113900

The following URIs do work:

The URI from combining the linkout and sample identifier listed for Gramene Rice in datasources.txt also does not resolve:
http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=osa-MIR171a

Is this the URI describing osa-MIR171a?
http://archive.gramene.org/db/genes/search_gene?acc=GR:0100777

If so, the linkout for Gramene Rice should be http://archive.gramene.org/db/genes/search_gene?acc=$id and the sample identifier should be GR:0100777.

Maybe Rice Ensembl Gene is the datasource that should be used for LOC_Os04g54800 in WP2211, because the Rice Ensembl Gene sample identifier LOC_Os04g54800 resembles LOC_Os04g56210, unlike the sample identifier listed for Gramene Rice osa-MIR171a. However, the linkout pattern from datasources.txt would indicate the linkout should be
http://www.gramene.org/Oryza_sativa/geneview?gene=LOC_Os04g54800
which gives this error:

Gene 'LOC_Os04g54800' not found

The identifier 'LOC_Os04g54800' is not present in the current release of the Ensembl Plants database.

This view requires a gene, transcript or protein identifier in the URL. For example:

http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=OS05G0113900

Then the linkout pattern in datasources.txt for Rice Ensembl Gene would need to be updated to http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=$id;db=otherfeatures

ensure alignment with Bioregistry.io

Unrecognised HGNC URI pattern

The pattern http://identifiers.org/hgnc/HGNC%253A29350 was found in a linkset but was not recognized by BridgeDB. However it does redirect to the correct place which suggests that identifiers.org does recognize it. So, either the pattern needs added to BridgeDB or the linkset file changed.
The URI looks like it has been escaped wrong in some way.

example_identifier for SwissProt

datasources.txt lists CALM_HUMAN as an example_identifier for SwissProt. datasources_headers.txt defines example_identifier as

A valid example of a datasource identifier; may not be representative of all types of identifiers from a given resource 1851_s_at

I discussed this with Nick Juty, and the result was that CALM_HUMAN is an entry name:

The entry name is a useful mnemonic means of identifying a sequence, but, unlike the accession number, it is not a stable identifier.

Further comment from Nick:

I think that the gene name is CALM, and in humans it is CALM_HUMAN. But we use the identifier provided by UniProt for the record, not for the gene or protein. The identifier for the record is the stable identifier.

So is datasources.txt wrong to list CALM_HUMAN as an example identifier for SwissProt? Should the example_identifier instead be P62158?

Add taxonomy IDs to organism table

Can you add a new column to this table with the taxonomy IDs? This would be useful for a few known use cases. An additional column shouldn't mess up any current use cases.

https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.bio/resources/org/bridgedb/bio/organisms.txt

request for adding this species: Solanum tuberosum

See bridgedb/create-bridgedb-genedb-config#4

Unclear origin of the UCSC genome browser "uc\d{3}[a-z]{3}\.\d" type identifier

Following up from #2 and consulting the UC genome browser team ("I cannot find any gene with that identifier, so that may be the problem"), it is unclear what the correct fix is for the issue. The link out points out the browser directly, but either takes a gene name (like TP53) or a chromosal position. See also http://genome.ucsc.edu/FAQ/FAQlink.html#genes

Agree on using "UniProtKB" as full name

Update datasources.tsv

There are quite some outdated/deprecated example mappings in the datasources.txt file.
Several issues have been filled on that matter:

So, time to clean up that mess (perhaps make a "deprecateddatasources.txt file, to keep track?).
@egonw suggested to test the linkouts automatically to see if they resolve.

Curate remaining entries to the Bioregistry

After lots of careful curation, there are only four resources listed in this repository that I can't quite figure out

datasource_name	system_code	website_url	linkout_pattern	example_identifier	entity_identified	single_species	identifier_type	uri	regex	official_name	wikidata_property	bioregistry
Gramene Arabidopsis	EnAt	http://www.gramene.org/	http://www.gramene.org/Arabidopsis_thaliana/Gene/Summary?g=$id	ATMG01360-TAIR-G	gene	Arabidopsis thaliana	1	EnAt	AT[\dCM]G\d{5}-TAIR-G	Gramene Arabidopsis	nan	nan
Gramene Maize	EnZm	http://www.ensembl.org	http://www.maizesequence.org/Zea_mays/Gene/Summary?g=$id	GRMZM2G174107	gene	nan	1	EnZm	nan	Gramene Maize	nan	nan
Gramene Rice	EnOj	http://www.gramene.org/	http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=$id	osa-MIR171a	gene	nan	1	EnOj	nan	Gramene Rice	nan	nan
Rice Ensembl Gene	Os	http://www.gramene.org/Oryza_sativa	http://www.gramene.org/Oryza_sativa/geneview?gene=$id	LOC_Os04g54800	gene	Oryza sativa	1	Os	nan	Rice Ensembl Gene	nan	nan

Example URLs:

http://www.gramene.org/Arabidopsis_thaliana/Gene/Summary?g=ATMG01360-TAIR-G (works, but should just be ATMG01360)
http://www.maizesequence.org/Zea_mays/Gene/Summary?g=GRMZM2G174107 (redirects to https://ensembl.gramene.org/Zea_mays/Gene/Summary?g=GRMZM2G174107)
http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=osa-MIR171a (dead)
http://www.gramene.org/Oryza_sativa/geneview?gene=LOC_Os04g54800 (dead)

So the question is for the first two, what should we call these in Bioregistry? should they really get their own prefixes or is there a more general Gramene resolver for all of these IDs?

For the last two, can these be fixed? Maybe just need a new example from the same pattern.

WikiPathways linkouts

Tested the gene/product linkouts on Wikipathways

Affymetrix - log in required - is that intended, shall we keep that?
Agilent, Illumina - not a link, just a number - is that intended?
Ensembl - patch bridgedb/BridgeDb#146 and bridgedb/BridgeDb#147
UCSC genome browser - does not work - action required
UniGene - does not work - action required
WikiGenes - connection timeout - check again and then decide

Update HGNC linkouts

See also #6 ; HGNC linkout on WikiPathways website, and PathVisio don't work...

WikiPathways WP4868

Result:

PathVisio (WP4868):

Result:

Which system codes for ChEMBL?

In commit ab02add1bee33b47e45bfdee7f89190681e9bcf2 @egonw added to org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt:

ChEMBL compound Cl  http://www.ebi.ac.uk/chembl/    https://www.ebi.ac.uk/chembl/compound/inspect/$id   CHEMBL308052    metabolite      1   urn:miriam:chembl.compound  ^CHEMBL\d+$ ChEMBL compound

Using system code Cl here clashes with the equivalent entry in org.bridgedb.rdf, which uses ChEMBLCompound - what was the reason for going with Cl?

See both IdentifiersOrgDataSource.ttl and in IdentifiersOrgDataSource.txt

This (luckily) causes the IdentifersOrgReaderTest test to fail with:

Caused by: java.lang.IllegalArgumentException: System code does not match for DataSource ChEMBL compound was Cl so it can not be changed to ChEMBLCompound
    at org.bridgedb.DataSource.findOrRegister(DataSource.java:640)
    at org.bridgedb.DataSource.register(DataSource.java:620)
    at org.bridgedb.rdf.BridgeDBRdfHandler.readDataSource(BridgeDBRdfHandler.java:131)
    at org.bridgedb.rdf.BridgeDBRdfHandler.getDataSource(BridgeDBRdfHandler.java:121)
    at org.bridgedb.rdf.BridgeDBRdfHandler.readAllDataSources(BridgeDBRdfHandler.java:113)
    at org.bridgedb.rdf.BridgeDBRdfHandler.doParseRdfInputStream(BridgeDBRdfHandler.java:92)
    ... 33 more

The system codes used for ChEMBL within IdentifiersOrgDataSource.txt are not ideal:

ChEMBLCompound
ChemblId
ChemblMolecule
chembl.target
ChemblTarget (!)
Chembl16TargetComponent

Those are both very long, includes (wrong) version number, and has duplicates and are inconsistent.

At Identifiers.org we find the names

(but nothing for molecules, assays or target component)

Cc is already used by CCDS.

After discussing this with @egonw I suggest modifying org/bridgedb/bio/datasources.txt to use system codes:

ChC (ChEMBL compound)
ChT (ChEMBL target)
ChTC (ChEMBL Target Component) -- or ChP for "protein"?

CamelCasing here mimics other entries like EnMm (Ensembl Mouse).

Views?

Incorrect linkout patterns or example identifiers

The following linkout pattern + example identifier combinations appear to be outdated or incorrect:

Do not resolve:

Outdated:

update of VMH metabolite links

Hello

please update the links to the vmh database: e.g., https://www.vmh.life/#metabolite/glc_D

It uses still a vmh.uni.lu link.

Thanks Ines