Coder Social home page Coder Social logo

julielab / costosys Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 870 KB

The Corpus Storage System (CoStoSys) is a tool and abstraction layer for a PostgreSQL document database and part of the Jena Document Information System (JeDIS).

License: BSD 2-Clause "Simplified" License

Java 99.90% Shell 0.10%
annotation database medline

costosys's People

Contributors

dependabot[bot] avatar khituras avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

costosys's Issues

Mention database configuration file parameter

Neither the README nor the docbook mention how to set the database configuration file (it's the -dbc parameter) or that it defaults to dbcConfiguration.xml.

This should be part of the 'Quickstart' section of the doc.

Delete reference to removed mirror subset

Please add the following entry to the documentation:

In case you removed the mirror subset of a table via a database statement (DROP TABLE name), it is also necessary to remove the corresponding entry to it in the _mirrorsubsets table of _data. This is not done automatically. Should you try to create another mirror subset with the same name as the deleted one, CoStoSys will throw an error message like this:

Error executing SQL command: INSERT INTO _data._mirrorSubsets VALUES ('_data.documents','user.medline',true)
org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "_data_mirrorsubsets_pkey"
  Detail: Key (subsettablename)=(user.medline) already exists.

The strings 'user' and 'medline' in the message are to be exchanged with the name of your schema and table, of course.
This problem is revolved by issuing an DELETE query against the _mirrorsubsets table. In the example above, this statement would be:
DELETE FROM _data._mirrorsubsets WHERE subsettablename ='user.medline';

Add "Pubmed update" module

Currently, the code to update pubmed and memorize the updates already applied is external code. Create some kind of addon, plugin or whatsoever to enable the update functionality of CoStoSys to do this.

Make unshared connections available

Currently, all CoStoSys connections can be shared via obtainOrReserveConnection. However, this can cause issues when two code snippets use the same connection but one needs auto commit and the other doesn't.
Thus, it should be possible to get a connection that is not shared with other code.

Add "drop table" functionality

Allow the CoStoSys CLI to drop tables. This is especially important for mirror tables that should then automatically be removed from the _mirrorsubsets table. See #2 .

Allow update files to be processed again.

Add a -iap, --ignore-already-processed switch to the -im (import Medline) mode. Specifying this should cause update files to be applied to database import and by deleters no matter if they were imported already.

Add 'nsAware' attribute to 'tableSchema'

Until now, XML namespace awareness was switched on by default.
The new nsAware attribute for the tableSchema element allows to en- or disable namespace awareness for the underlying VTD XML library.

Implement JeDIS binary format decoding for CLI queries

The binary format is more efficient in storage but cannot be queries as easily as text or XML storage.

For decoding the data we need to

  • indicate in the table schema that it stores binary data
  • provide a facility to specify UIMA type descriptors which are required by the decoding algorithm

Restructure Medline import/update

Historically, we made a difference between "import" and "update": Import does not check for duplicates but just issues a series of INSERT statements. At some point however, document IDs occurred multiple times not only in update files but even in the baseline files. At this point we began to use "updates" all the time.

The next step will now be to not make any difference between baseline and updates at all. All that matters is order: First the baseline files, then the update files in ascending order relative to the file name.

Thus, we will just define an ordered list of directories in the Medline/Pubmed update XML configuration file that will all be taken as update files. Hence, they will be written into the database as update files and marked as being imported.

When then a new Medline baseline is issued, we only need to point to the new files and the update will begin from scratch, overwriting old documents with the same PMID.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.