Coder Social home page Coder Social logo

o2r-finder's People

Contributors

lukaslohoff avatar nuest avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

o2r-finder's Issues

Evaluate golang

Only index o2r metadata in Elasticsearch

image

Reasons:

  • raw metadata is not "checked"
  • zenodo metadata may only contain what is already in o2r metadata (nothing is missing)
  • full text should still be indexed

@7048730 Can you confirm this? Do you see any reason to index more than just the o2r metadata, i.e. should the raw metadata be "searchable"?

Implement synching to Elasticsearch

Provide a list of similar compendia

Given a compendium id, the finder can provide a list of similar compendia based on upcoming research.
This list could be integrated into a UI, for example:

  • ordering of compendia for a substitution
  • show similar ERCs in the detail view of one ERC
  • show "near" ERCs when looking at the spatial data of one ERC

The API could take an identifier and return a list of identifiers and scores, or just proxy through the Elasticsearch response (i.e. include the actual metadata).

GET /api/v1/explore/Xid4U or GET /api/v1/compendium/Xid4U/similar

{ [
  { "id": "12345", score: 8.342 },
  { "id": "abcdf", score: 5.42 },
  { "id": "12ab1", score: 1.42 },
  { "id": "qwert", score: 1.00 },
  { "id": "asdfg", score: 0.1742 }
] }

Alternatively the API endpoint could re-use /search, but only if the response structure is the same, i.e. GET /api/v1/search?similar=Xid4U.

A verbose response could provide access to the sub-scores:

GET /api/v1/explore/Xid4U?verbose
{ 
  "results": [
     {"id": "aiejf", "score": 0.8813, "spatial": 0.2412, "code": 0.991, "text": 0.5 }, 
     {"id": "izxye", "score": 0.6713, ...}, 
     {...}
 ]
}

Using a further parameter, only one or more specific sub-scores can be taken into account (and only the used ones are in the response):

/api/v1/explore/Xid4U?component=code
/api/v1/explore/Xid4U?component=text,data
/api/v1/explore/Xid4U?component=text,data,spatial,temporal
/api/v1/explore/Xid4U?component=all (default)

componenent implies verbose.

Thanks @LukasLohoff @jansule for collecting first ideas on this!

  • Dockerfile is used for similarity calculation > could be standalone module (paper idea: "Tokenizing and similarity calculation for dockerfiles in document search engines/Elasticsearch"_)
  • Data files are integrated
  • Code files are integrated
  • What are good weights to combine the different similarities?

Handle special characters in incoming query string to support searching for DOIs

We can expect people to submitting URLs and URIs, and also DOIs, to the search endpoint:

https://o2r.uni-muenster.de/api/v1/search?q=10.5555%2F12345678

Currently this returns an error:

failed_shards": [

    {
        "shard": 0,
        "index": "o2r",
        "node": "XMeUHWUmRyepQQbbDdRXeA",
        "reason": {
            "type": "query_shard_exception",
            "reason": "Failed to parse query [10.5555/12345678]",
            "index_uuid": "AWf5E-1_TrmdyKneJHYsnA",
            "index": "o2r",
            "caused_by": {
                "type": "parse_exception",
                "reason": "Cannot parse '10.5555/12345678': Lexical error at line 1, column 17.  Encountered: <EOF> after : \"/12345678\"",
                "caused_by": {
                    "type": "token_mgr_error",
                    "reason": "Lexical error at line 1, column 17.  Encountered: <EOF> after : \"/12345678\""
                }
            }
        }
    }

]

This should be handled by finder, ideally by supporting the search for DOIs.

Implement simple full text search using MongoDB

Later we might leverage an advanced search engine (i.e. Elasticsearch), but for first prototypes MongoDB's full text search should suffice.

Resources:

Calling http://.../api/v1/search?q=word should return a list of compendia matching "word"

Filters:

  • ...?q=term&type={compendium,job} (would be interesting to index job log outputs!)

Completion suggester

Steps

Docs

Draft API

curl -XPOST '.../api/v1/suggest' -d'
{
    "suggest" : {
        "prefix" : "interp",
        "completion" : {
            "size": 5
        },
    }
}'

Returns the suggest part of the Elasticsearch completion suggester response (with internal fields removed, like _index, _id etc.):

{
  "suggest" : [ {
    "text" : "interp",
    "offset" : 0,
    "length" : 6,
    "options" : [ {
      "text" : "interpolation"
    }, {
      "text" : "interpolate"
    }{
      "text" : "international"
    }{
      "text" : "uninterpreted"
    } ]
  } ]
}

size is optional and defaults to 5, so the completion field can also be missing completely.

Implementation note:

Internally, the query above is simple extended with the configured suggest field to

{
    "suggest" : {
        "prefix" : "interp",
        "completion" : {
            "field" : "suggest",
            "size": 5
        },
    }
}'

Maybe use dot-notation to expand the JSON object?

Numerous finder logs

After starting the platform, the finder logs drive crazy giving lots and lots of logs as shown in the screenshot. This makes it hard to see logs from the other services so I usually miss the debug logs of my own service and have to restart the entire platform.
However, the platform works fine.
bildschirmfoto vom 2018-05-29 09-35-51

Use text instead of string field

When starting reference-implementation, I get the following error:

elasticsearch2_1  | [2018-05-17T09:18:44,516][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1  | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1  | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]
elasticsearch2_1  | [2018-05-17T09:18:44,520][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1  | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1  | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]

Put compendia and jobs into different indices instead of using types

Compendia and jobs are completely different. Here are the Elasticsearch mappings (via http://localhost:9200/_mapping):

image

"Do your documents have similar mappings? If no, use different indices." via https://www.elastic.co/blog/index-vs-type

So, they should go into indices o2r-compendia and o2r-jobs, and the /search endpoint should have a property docs (?) to search for only specific or both, see https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-search.html#search-multi-index-type

Queries to be supported:

../search?q=my-search-term&resource=all (default)
../search?q=my-search-term&resource=job
../search?q=my-search-term&resource=compendium
../search?q=my-search-term&resource=compendium,job (effectively the same as "all")

Request entity too large

When the HTML documents become large, the BULK upload can run into troubles, see log below.

  • upload many papers from the test corpus
  • find out if this is a server or a client issue
  • test with reduced bulk size
ESMongoSync: Oplog tailing connection successful.
finder_1          | ESMongoSync: Processing watchers on priority level  2
finder_1          | ESMongoSync: Processing  jobs  collection
finder_1          | ESMongoSync: Batch creation complete. Processing...
finder_1          | ESMongoSync: Number of documents in batch -  20
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | { Error: Request Entity Too Large
finder_1          |     at respond (/finder/node_modules/elasticsearch/src/lib/transport.js:307:15)
finder_1          |     at checkRespForFailure (/finder/node_modules/elasticsearch/src/lib/transport.js:266:7)
finder_1          |     at HttpConnector.<anonymous> (/finder/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
finder_1          |     at IncomingMessage.bound (/finder/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
finder_1          |     at emitNone (events.js:111:20)
finder_1          |     at IncomingMessage.emit (events.js:208:7)
finder_1          |     at endReadableNT (_stream_readable.js:1055:12)
finder_1          |     at _combinedTickCallback (internal/process/next_tick.js:138:11)
finder_1          |     at process._tickCallback (internal/process/next_tick.js:180:9)
finder_1          |   status: 413,
finder_1          |   displayName: 'RequestEntityTooLarge',
finder_1          |   message: 'Request Entity Too Large',
finder_1          |   path: '/_bulk',
finder_1          |   query: {},
finder_1          |   body: '{"index":{"_index":"compendia","_type":"compendia","_id":"5a9fe27b888b05001c5ffb66"}}\n{"createdAt":"2018-03-07T13:00:43.032Z","updatedAt":"2018-03-07T13:00:43.032Z","id":"5a9fe27b888b05001c5ffb66","user":"0000-0001-6225-344X","metadata":{"o2r":{"upload_type":"publication","title":"Capacity of container ships in seaborne trade from 1980 to 2016 (in million dwt)*","temporal":{"end":"2017-03-07T00:00:00","begin":"2017-03-07T00:00:00"},"spatial":{"union":{"bbox":[[181,181],[-181,181],[-181,-181],[181,-181]]},"files":[]},"publication_type":"other","publication_date":"2018-03-07","paperLanguage":[],"mainfile_candidates":["main.Rmd"],"mainfile":"main.Rmd","license":{"uibindings":null,"text":null,"md":null,"data":null,"code":null},"keywords":["container","ship","trade","statistic"],"interaction":[],"inputfiles":["data.csv"],"identifier":{"reserveddoi":null,"doiurl":"https://doi.org/10.5555/666655554444","doi":"10.5555/666655554444"},"ercIdentifier":"Q8AKA","displayfile_candidates":["display.html"],"displayfile":"display.html","description":"Capacity of container ships in seaborne trade of the world container ship fleet.\\n","depends":[],"creators":[{"orcid":"0000-0002-0024-5046","name":"Daniel Nüst","affiliation":"o2r team"}],"communities":[{"identifier":"o2r"}],"codefiles":["main.Rmd"],"access_right":"open"}},"substituted":false,"compendium":false,"bag":false,"candidate":true,"jobs":[],"created":"2018-03-07T13:00:43.032Z","compendium_id":"Q8AKA","files":{"path":"/api/v1/compendium/Q8AKA/data","name":"Q8AKA","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc","name":".erc","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"file"}],"size":4979,"type":"directory"},{"path":"/api/v1/compendium/Q8AKA/data/data.csv","name":"data.csv","size":122,"extension":".csv","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/display.html","name":"display.html","size":651313,"extension":".html","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/main.Rmd","name":"main.Rmd","size":1117,"extension":".rmd","type":"file"}],"size":657531,"type":"directory"},"texts":[{"path":"/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"application/json"},{"path":"/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"application/json"},{"path":"/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"application/json"},{"path":"/data.csv","name":"data.csv","size":122,"extension":".csv","type":"text/csv","content":"\\"year\\",\\"capacity\\"\\n\\"1980\\",11\\n\\"1985\\",20\\n\\"1990\\",26\\n\\"1995\\",44\\n\\"2000\\",64\\n\\"2005\\",98\\n\\"2010\\",169\\n\\"2014\\",216\\n\\"2015\\",228\\n\\"2016\\",244\\n"},{"path":"/display.html","name":"display.html","size":651313,"extension":".html","type":"text/html","content":"<!DOCTYPE html>\\n\\n<html xmlns=\\"http://www.w3.org/1999/xhtml\\">\\n\\n<head>\\n\\n<meta charset=\\"utf-8\\" />\\n<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=utf-8\\" />\\n<meta name=\\"generator\\" content=\\"pandoc\\" />\\n\\n\\n\\n<meta name=\\"date\\" content=\\"2017-01-01\\" />\\n\\n<title>Capacity of container ships in s

[...]

Fix real-time indexing

It seems newly added documents are processed by finder but they do not appear in Elasticsearch.

Random crashing off the finder

When starting the new platform with docker-composeup or when uploading ERCs, it sometimes happens (could not find any regularity) that the Finder crashes. After the crash, the API endpoint search localhost/api/v1/search is no longer accessible. All other API endpoints are still accessible. Markus has the same error and sudo sysctl -q -w vm.max_map_count=262144 has been set.

Add tests

Add some integration tests:

  • after creating a new compendium, check if all metadata arrived in the search database correctly.
  • ...

To get them running on Travis CI, the existing configurations (e.g. o2r-substituter) should help a lot.

Index text files of a compendium

The text of all files within a compendium should be indexed for full text search.

Evaluate potential approaches:

  • read text files and index them in a "full text" field?
  • handle them as attachments?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.