julielab / julielab-concept-db-manager Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 64.51 MB

A project that organizes the insertion of (ontological) concepts into a Neo4j graph database.

Java 99.61% Shell 0.37% Dockerfile 0.02%

julielab-concept-db-manager's Introduction

JULIE Lab Concept Database Manager

A project that organizes the insertion of (ontological) concepts into a Neo4j graph database.

julielab-concept-db-manager's People

Contributors

Watchers

julielab-concept-db-manager's Issues

Fix NoHttpResponseException

This happens when exporting large files where each requests takes several minutes to complete. Then, the pooled HTTP connections become stale and the HttpComponents library seems to be able to handle it. The solution for now is to avoid the reuse of connections. A better solution would apparently be to switch the HTTP library, e.g. to OkHttp.

Stream HTTP responses

Currently, responses from Neo4j are read into one large string before writing to disc. This causes crashes. Rather stream the response and write it directly to disc without trying to store everything in memory.

Support javax.ws.rs.StreamOutput

Our Neo4j plugins have been converted to unmanaged extensions. Some endpoints return StreamOutput which can then be converted into byte[] or string. Let the JavaClassFileDBExporter support that.

Fix tests of julielab-concept-db-manager-core

===============================================
Default suite
Total tests run: 8, Failures: 3, Skips: 0
Configuration Failures: 1, Skips: 0
===============================================

FAILED CONFIGURATION: @AfterClass setupTest
java.nio.file.DirectoryNotEmptyException: src/test/resources/graph.db

FAILED: testInsertion
de.julielab.concepts.util.ConceptInsertionException: org.apache.commons.configuration2.ex.ConfigurationException: The following required parameters are not set in the configuration:
configuration.pluginname
configuration.pluginendpoint

FAILED: testConfiguration
java.lang.AssertionError: 
Expecting:
 <[null, "ID_MAP_NCBI_GENES", "FACET", "BIO_PORTAL"]>
to contain:
 <["/db/data/ext/Export/graphdb/hypernyms", "ID_MAP_NCBI_GENES"]>
but could not find:
 <["/db/data/ext/Export/graphdb/hypernyms"]>

FAILED: testAggregation
de.julielab.concepts.util.ConceptInsertionException: de.julielab.concepts.util.InternalNeo4jException: Errors: No such ServerPlugin: "null"

Add concept import for HGNC gene groups.

Stream concepts to server

We already had memory issues with large concept import processes. We solved this preliminarily by batching. An even better solution would be streaming: Write the concepts into a stream that is directly connected to the database server where the stream is read.

Add GO annotations

Gene Ontology terms are assigned to genes to describe their function. Integrate GO as nodes, e.g. by previously importing the Gene Ontology via the BioPortalConceptCreator, and connect them to the genes they annotate.

Add UniProt mappings.

Using only the UniProt idmapping, create the UniProt items as nodes with a relationship to the genes they are mapped to.

Let NCBI Gene Concept Creator stream from disc

Right now, the NCBI Gene Concept Creator just reads all concepts - requireing 27G of memory - and then sends it all to the server. Let it rather return a real stream where the concepts are read step by step. This allows the import to begin immediately instead of first waiting to read everything and it saves a lot of memory.

Fix CypherFileDBExporter issues

The exporter was not included in the list of services.
The expected configuration format was an older one.

Add possibility to set/override ImportOptions.

For flexibel option changes.

Bump dependency versions

jackson XML
guava
common-configurations2/beanutils
neo4j
and more ;-)

Try to unify import/operate/export configuration XML schemas

Right now, the import element may have a subelement "serverplugininserter" while operate and export just have an "exporter" or "operator" element where the java class name of the handling class goes into. All other operator or exporter specific configurations (e.g. plugin endpoint, method etc which go into the special element for imports) are direct children of the "export" element. This is confusing. Try to resolve this.
Points to consider:

The DatabaseOperationService and DataExportService classes must still be able to find the correct class
The XML schema should still be helping in the configuration process

Support unmanaged server extensions

Since server plugins have been removed in Neo4j 4.x, support unmanaged server extensions.

Add the total number of concepts to `ImportConcepts`

This should be an optional specification. But it could be useful for progress reports.

NCBI Gene Concept creator: Remove homologene.

Homologene hasn't been updated since 2014 and isn't going to be according to an eMail I received from the NCBI help desk on November 26th, 2018 saying:

I am not aware of a search that would allow you to specifically search for paralogs. The only way to tell would be to look at a cluster. If it contains more than one gene from the same organism, you could perhaps consider these as paralogs. Proceed with caution as the process for Homologene is completely  automated with no manual curation. Also, the last build was in 2014: 

https://www.ncbi.nlm.nih.gov/homologene/statistics/

There are currently no plans for a new release.

and

Q: Since Homologene is no longer updated and thus perhaps even outdated (?), would you recommend to only use gene_groups/gene_ortholog for new tools?
A: Yes, I do suggest that (personal suggestion- I am not aware of any NCBI-wide guidelines on this topic).

Keep the top-homology code in place for the future but it won't be used right now.