Coder Social home page Coder Social logo

datasources's Introduction

DOI

datasources

Repository with the BridgeDb data source. The reason to abstract this out is that other tools no longer depend on updates of the BridgeDb Java library to use the information.

The following URLs can be used in downstream tools:

It includes interoperability layers with identifiers.org and Bioregistry.io.

Testing

There are tests for data integrity that can be run with the following commands in the shell:

$ pip install tox
$ tox

datasources's People

Contributors

cthoyt avatar denisesl22 avatar egonw avatar finterly avatar hbasaric avatar mkutmon avatar tabbassidaloii avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasources's Issues

Add Bioregistry prefix column

It's great you're curating a registry here! I saw it's linking out to miriam prefixes so I figured it would be good to include Bioregistry ones as well. This way it would be easier to monitor any potentially novel contributions here and suck them back in to the Bioregistry

I can write a script that does this based on Miriam and Wikidata mappings to demonstrate.

Edit: while working on this, I noticed there were several invalid MIRIAM prefixes, which lead to #23

Incorrect linkout patterns for Gramene Rice and Rice Ensembl Gene

I noticed a linkout that fails for a GeneProduct in the Geranylgeranyldiphosphate biosynthesis II pathway WP2211 at WikiPathways. The GeneProduct has a datasource of "Gramene Rice" and an identifier of "LOC_Os04g56210", as highlighted in green here.

Based on datasources.txt, the linkout should be
http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=LOC_Os04g56210`
but that link gives the following error:

  Database Error

  Could not connect to the core;g=LOC_Os04g56210 database.

  This view requires a gene, transcript or protein identifier in the URL. For example:

  http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=OS05G0113900

The following URIs do work:

The URI from combining the linkout and sample identifier listed for Gramene Rice in datasources.txt also does not resolve:
http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=osa-MIR171a

Is this the URI describing osa-MIR171a?
http://archive.gramene.org/db/genes/search_gene?acc=GR:0100777

If so, the linkout for Gramene Rice should be http://archive.gramene.org/db/genes/search_gene?acc=$id and the sample identifier should be GR:0100777.

Maybe Rice Ensembl Gene is the datasource that should be used for LOC_Os04g54800 in WP2211, because the Rice Ensembl Gene sample identifier LOC_Os04g54800 resembles LOC_Os04g56210, unlike the sample identifier listed for Gramene Rice osa-MIR171a. However, the linkout pattern from datasources.txt would indicate the linkout should be
http://www.gramene.org/Oryza_sativa/geneview?gene=LOC_Os04g54800
which gives this error:

Gene 'LOC_Os04g54800' not found

The identifier 'LOC_Os04g54800' is not present in the current release of the Ensembl Plants database.

This view requires a gene, transcript or protein identifier in the URL. For example:

http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=OS05G0113900

Then the linkout pattern in datasources.txt for Rice Ensembl Gene would need to be updated to http://ensembl.gramene.org/Oryza_sativa/Gene/Summary?g=$id;db=otherfeatures

Unrecognised HGNC URI pattern

The pattern http://identifiers.org/hgnc/HGNC%253A29350 was found in a linkset but was not recognized by BridgeDB. However it does redirect to the correct place which suggests that identifiers.org does recognize it. So, either the pattern needs added to BridgeDB or the linkset file changed.
The URI looks like it has been escaped wrong in some way.

example_identifier for SwissProt

datasources.txt lists CALM_HUMAN as an example_identifier for SwissProt. datasources_headers.txt defines example_identifier as

A valid example of a datasource identifier; may not be representative of all types of identifiers from a given resource 1851_s_at

I discussed this with Nick Juty, and the result was that CALM_HUMAN is an entry name:

The entry name is a useful mnemonic means of identifying a sequence, but, unlike the accession number, it is not a stable identifier.

Further comment from Nick:

I think that the gene name is CALM, and in humans it is CALM_HUMAN. But we use the identifier provided by UniProt for the record, not for the gene or protein. The identifier for the record is the stable identifier.

So is datasources.txt wrong to list CALM_HUMAN as an example identifier for SwissProt? Should the example_identifier instead be P62158?

Curate remaining entries to the Bioregistry

After lots of careful curation, there are only four resources listed in this repository that I can't quite figure out

datasource_name system_code website_url linkout_pattern example_identifier entity_identified single_species identifier_type uri regex official_name wikidata_property bioregistry
Gramene Arabidopsis EnAt http://www.gramene.org/ http://www.gramene.org/Arabidopsis_thaliana/Gene/Summary?g=$id ATMG01360-TAIR-G gene Arabidopsis thaliana 1 EnAt AT[\dCM]G\d{5}-TAIR-G Gramene Arabidopsis nan nan
Gramene Maize EnZm http://www.ensembl.org http://www.maizesequence.org/Zea_mays/Gene/Summary?g=$id GRMZM2G174107 gene nan 1 EnZm nan Gramene Maize nan nan
Gramene Rice EnOj http://www.gramene.org/ http://www.gramene.org/Oryza_sativa/Gene/Summary?db=core;g=$id osa-MIR171a gene nan 1 EnOj nan Gramene Rice nan nan
Rice Ensembl Gene Os http://www.gramene.org/Oryza_sativa http://www.gramene.org/Oryza_sativa/geneview?gene=$id LOC_Os04g54800 gene Oryza sativa 1 Os nan Rice Ensembl Gene nan nan

Example URLs:

So the question is for the first two, what should we call these in Bioregistry? should they really get their own prefixes or is there a more general Gramene resolver for all of these IDs?

For the last two, can these be fixed? Maybe just need a new example from the same pattern.

WikiPathways linkouts

Tested the gene/product linkouts on Wikipathways

  • Affymetrix - log in required - is that intended, shall we keep that?
  • Agilent, Illumina - not a link, just a number - is that intended?
  • Ensembl - patch bridgedb/BridgeDb#146 and bridgedb/BridgeDb#147
  • UCSC genome browser - does not work - action required
  • UniGene - does not work - action required
  • WikiGenes - connection timeout - check again and then decide

Update HGNC linkouts

See also #6 ; HGNC linkout on WikiPathways website, and PathVisio don't work...

image
Result:
image

  • PathVisio (WP4868):
    image
    Result:
    image

Which system codes for ChEMBL?

In commit ab02add1bee33b47e45bfdee7f89190681e9bcf2 @egonw added to org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt:

ChEMBL compound Cl  http://www.ebi.ac.uk/chembl/    https://www.ebi.ac.uk/chembl/compound/inspect/$id   CHEMBL308052    metabolite      1   urn:miriam:chembl.compound  ^CHEMBL\d+$ ChEMBL compound

Using system code Cl here clashes with the equivalent entry in org.bridgedb.rdf, which uses ChEMBLCompound - what was the reason for going with Cl?

See both IdentifiersOrgDataSource.ttl and in IdentifiersOrgDataSource.txt

This (luckily) causes the IdentifersOrgReaderTest test to fail with:

Caused by: java.lang.IllegalArgumentException: System code does not match for DataSource ChEMBL compound was Cl so it can not be changed to ChEMBLCompound
    at org.bridgedb.DataSource.findOrRegister(DataSource.java:640)
    at org.bridgedb.DataSource.register(DataSource.java:620)
    at org.bridgedb.rdf.BridgeDBRdfHandler.readDataSource(BridgeDBRdfHandler.java:131)
    at org.bridgedb.rdf.BridgeDBRdfHandler.getDataSource(BridgeDBRdfHandler.java:121)
    at org.bridgedb.rdf.BridgeDBRdfHandler.readAllDataSources(BridgeDBRdfHandler.java:113)
    at org.bridgedb.rdf.BridgeDBRdfHandler.doParseRdfInputStream(BridgeDBRdfHandler.java:92)
    ... 33 more

The system codes used for ChEMBL within IdentifiersOrgDataSource.txt are not ideal:

  • ChEMBLCompound
  • ChemblId
  • ChemblMolecule
  • chembl.target
  • ChemblTarget (!)
  • Chembl16TargetComponent

Those are both very long, includes (wrong) version number, and has duplicates and are inconsistent.

At Identifiers.org we find the names

(but nothing for molecules, assays or target component)

Cc is already used by CCDS.

After discussing this with @egonw I suggest modifying org/bridgedb/bio/datasources.txt to use system codes:

  • ChC (ChEMBL compound)
  • ChT (ChEMBL target)
  • ChTC (ChEMBL Target Component) -- or ChP for "protein"?

CamelCasing here mimics other entries like EnMm (Ensembl Mouse).

Views?

Incorrect linkout patterns or example identifiers

The following linkout pattern + example identifier combinations appear to be outdated or incorrect:

Do not resolve:

Outdated:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.