Coder Social home page Coder Social logo

julielab-concept-db-manager's Introduction

Automated Release Notes by gren

JULIE Lab Concept Database Manager

A project that organizes the insertion of (ontological) concepts into a Neo4j graph database.

julielab-concept-db-manager's People

Contributors

khituras avatar pikatech avatar

Watchers

 avatar  avatar  avatar

julielab-concept-db-manager's Issues

Fix NoHttpResponseException

This happens when exporting large files where each requests takes several minutes to complete. Then, the pooled HTTP connections become stale and the HttpComponents library seems to be able to handle it. The solution for now is to avoid the reuse of connections. A better solution would apparently be to switch the HTTP library, e.g. to OkHttp.

Stream HTTP responses

Currently, responses from Neo4j are read into one large string before writing to disc. This causes crashes. Rather stream the response and write it directly to disc without trying to store everything in memory.

Support javax.ws.rs.StreamOutput

Our Neo4j plugins have been converted to unmanaged extensions. Some endpoints return StreamOutput which can then be converted into byte[] or string. Let the JavaClassFileDBExporter support that.

Fix tests of julielab-concept-db-manager-core

===============================================
Default suite
Total tests run: 8, Failures: 3, Skips: 0
Configuration Failures: 1, Skips: 0
===============================================

FAILED CONFIGURATION: @AfterClass setupTest
java.nio.file.DirectoryNotEmptyException: src/test/resources/graph.db

FAILED: testInsertion
de.julielab.concepts.util.ConceptInsertionException: org.apache.commons.configuration2.ex.ConfigurationException: The following required parameters are not set in the configuration:
configuration.pluginname
configuration.pluginendpoint

FAILED: testConfiguration
java.lang.AssertionError: 
Expecting:
 <[null, "ID_MAP_NCBI_GENES", "FACET", "BIO_PORTAL"]>
to contain:
 <["/db/data/ext/Export/graphdb/hypernyms", "ID_MAP_NCBI_GENES"]>
but could not find:
 <["/db/data/ext/Export/graphdb/hypernyms"]>

FAILED: testAggregation
de.julielab.concepts.util.ConceptInsertionException: de.julielab.concepts.util.InternalNeo4jException: Errors: No such ServerPlugin: "null"

Stream concepts to server

We already had memory issues with large concept import processes. We solved this preliminarily by batching. An even better solution would be streaming: Write the concepts into a stream that is directly connected to the database server where the stream is read.

Add GO annotations

Gene Ontology terms are assigned to genes to describe their function. Integrate GO as nodes, e.g. by previously importing the Gene Ontology via the BioPortalConceptCreator, and connect them to the genes they annotate.

Add UniProt mappings.

Using only the UniProt idmapping, create the UniProt items as nodes with a relationship to the genes they are mapped to.

Let NCBI Gene Concept Creator stream from disc

Right now, the NCBI Gene Concept Creator just reads all concepts - requireing 27G of memory - and then sends it all to the server. Let it rather return a real stream where the concepts are read step by step. This allows the import to begin immediately instead of first waiting to read everything and it saves a lot of memory.

Try to unify import/operate/export configuration XML schemas

Right now, the import element may have a subelement "serverplugininserter" while operate and export just have an "exporter" or "operator" element where the java class name of the handling class goes into. All other operator or exporter specific configurations (e.g. plugin endpoint, method etc which go into the special element for imports) are direct children of the "export" element. This is confusing. Try to resolve this.
Points to consider:

  • The DatabaseOperationService and DataExportService classes must still be able to find the correct class
  • The XML schema should still be helping in the configuration process

NCBI Gene Concept creator: Remove homologene.

Homologene hasn't been updated since 2014 and isn't going to be according to an eMail I received from the NCBI help desk on November 26th, 2018 saying:

I am not aware of a search that would allow you to specifically search for paralogs. The only way to tell would be to look at a cluster. If it contains more than one gene from the same organism, you could perhaps consider these as paralogs. Proceed with caution as the process for Homologene is completely  automated with no manual curation. Also, the last build was in 2014: 

https://www.ncbi.nlm.nih.gov/homologene/statistics/

There are currently no plans for a new release.

and

Q: Since Homologene is no longer updated and thus perhaps even outdated (?), would you recommend to only use gene_groups/gene_ortholog for new tools?
A: Yes, I do suggest that (personal suggestion- I am not aware of any NCBI-wide guidelines on this topic).

Keep the top-homology code in place for the future but it won't be used right now.

Add dbXref IDs from gene_info.

A number of ID mappings from other gene databases are integrated into gene_info itself. Use this to create nodes for those other genes and connect them via relationships to the NCBI Gene node they belong to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.