o2r-project / o2r-finder Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 484 KB

Node.js implementation of search features for the o2r API

License: Apache License 2.0

JavaScript 97.24% Shell 0.69% Dockerfile 2.07%

elasticsearch microservice mongodb

o2r-finder's People

Contributors

Stargazers

Watchers

Forkers

nuest jansule lukaslohoff

o2r-finder's Issues

Evaluate golang

Important: Using golang is not about performance, but about trying out sth. new 😄

Add configuration for Analyzer

On analyzers: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

decide on most suitable analyzer for our current use case (text of scientific papers)
implement an analyzer configuration
integrate automatic configuration on finder startup

@LukasLohoff does this make sense?

Only index o2r metadata in Elasticsearch

Reasons:

raw metadata is not "checked"
zenodo metadata may only contain what is already in o2r metadata (nothing is missing)
full text should still be indexed

@7048730 Can you confirm this? Do you see any reason to index more than just the o2r metadata, i.e. should the raw metadata be "searchable"?

Implement synching to Elasticsearch

~~https://www.elastic.co/guide/en/elasticsearch/client/community/current/index.html#go~~
implement a syncing solution for MongoDB and Elasticsearch
- https://www.linkedin.com/pulse/5-way-sync-data-from-mongodb-es-kai-hao
- https://www.digitalocean.com/community/tutorials/how-to-sync-transformed-data-from-mongodb-to-elasticsearch-with-transporter-on-ubuntu-14-04 > transport is for one time sync only, and for that it seems quite complex (includes transformations etc.)
- mongo-connector seems best tool, but runs in Python, would require own container > keep in mind for later
  - https://github.com/mongodb-labs/mongo-connector
  - https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=mongo-connector&starCount=0
- https://github.com/toystars/node-elasticsearch-sync > seems simple enough, just subscribes to oplog and then pushes things to ES
documentation

node-elasticsearch-sync doesn't remove deleted files from index

While writing the tests for the finder, I noticed that documents in the index are only deleted when the finder is started.

When doing for example a db.drop() call, the documents which were deleted in the database remain in the elasticsearch index.

Provide a list of similar compendia

Given a compendium id, the finder can provide a list of similar compendia based on upcoming research.
This list could be integrated into a UI, for example:

ordering of compendia for a substitution
show similar ERCs in the detail view of one ERC
show "near" ERCs when looking at the spatial data of one ERC

The API could take an identifier and return a list of identifiers and scores, or just proxy through the Elasticsearch response (i.e. include the actual metadata).

GET /api/v1/explore/Xid4U or GET /api/v1/compendium/Xid4U/similar

{ [
  { "id": "12345", score: 8.342 },
  { "id": "abcdf", score: 5.42 },
  { "id": "12ab1", score: 1.42 },
  { "id": "qwert", score: 1.00 },
  { "id": "asdfg", score: 0.1742 }
] }

Alternatively the API endpoint could re-use /search, but only if the response structure is the same, i.e. GET /api/v1/search?similar=Xid4U.

A verbose response could provide access to the sub-scores:

GET /api/v1/explore/Xid4U?verbose
{ 
  "results": [
     {"id": "aiejf", "score": 0.8813, "spatial": 0.2412, "code": 0.991, "text": 0.5 }, 
     {"id": "izxye", "score": 0.6713, ...}, 
     {...}
 ]
}

Using a further parameter, only one or more specific sub-scores can be taken into account (and only the used ones are in the response):

/api/v1/explore/Xid4U?component=code
/api/v1/explore/Xid4U?component=text,data
/api/v1/explore/Xid4U?component=text,data,spatial,temporal
/api/v1/explore/Xid4U?component=all (default)

componenent implies verbose.

Thanks @LukasLohoff @jansule for collecting first ideas on this!

Dockerfile is used for similarity calculation > could be standalone module (paper idea: "Tokenizing and similarity calculation for dockerfiles in document search engines/Elasticsearch"_)
Data files are integrated
Code files are integrated
What are good weights to combine the different similarities?

Handle special characters in incoming query string to support searching for DOIs

We can expect people to submitting URLs and URIs, and also DOIs, to the search endpoint:

https://o2r.uni-muenster.de/api/v1/search?q=10.5555%2F12345678

Currently this returns an error:

failed_shards": [

    {
        "shard": 0,
        "index": "o2r",
        "node": "XMeUHWUmRyepQQbbDdRXeA",
        "reason": {
            "type": "query_shard_exception",
            "reason": "Failed to parse query [10.5555/12345678]",
            "index_uuid": "AWf5E-1_TrmdyKneJHYsnA",
            "index": "o2r",
            "caused_by": {
                "type": "parse_exception",
                "reason": "Cannot parse '10.5555/12345678': Lexical error at line 1, column 17.  Encountered: <EOF> after : \"/12345678\"",
                "caused_by": {
                    "type": "token_mgr_error",
                    "reason": "Lexical error at line 1, column 17.  Encountered: <EOF> after : \"/12345678\""
                }
            }
        }
    }

]

This should be handled by finder, ideally by supporting the search for DOIs.

Do not expose full Elasticsearch error messages, only reason

The complete Elasticsearch error messages should never be exposed, e.g.

https://o2r.uni-muenster.de/api/v1/search?q=not/supported

may contain the error reason in the response but not the full structure. There's no need to hide we use Elasticsearch, but we should not expose information and shards etc.

Implement simple full text search using MongoDB

Later we might leverage an advanced search engine (i.e. Elasticsearch), but for first prototypes MongoDB's full text search should suffice.

Resources:

Calling http://.../api/v1/search?q=word should return a list of compendia matching "word"

Filters:

...?q=term&type={compendium,job} (would be interesting to index job log outputs!)

Do not index anything under .erc/ directory

Index PDF documents in a compendium

Evaluate approaches

extract text and index it
use mapper attachments plugin, which uses Apache Tika
make sure not to index PDFs in .erc/

Steps

add required mapping (programatically)
- must have a suggests field
- which tokenizer to use? https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
try out fuzzy query: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#fuzzy

Docs

auto-complete: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

Draft API

curl -XPOST '.../api/v1/suggest' -d'
{
    "suggest" : {
        "prefix" : "interp",
        "completion" : {
            "size": 5
        },
    }
}'

Returns the suggest part of the Elasticsearch completion suggester response (with internal fields removed, like _index, _id etc.):

{
  "suggest" : [ {
    "text" : "interp",
    "offset" : 0,
    "length" : 6,
    "options" : [ {
      "text" : "interpolation"
    }, {
      "text" : "interpolate"
    }{
      "text" : "international"
    }{
      "text" : "uninterpreted"
    } ]
  } ]
}

size is optional and defaults to 5, so the completion field can also be missing completely.

Implementation note:

Internally, the query above is simple extended with the configured suggest field to

{
    "suggest" : {
        "prefix" : "interp",
        "completion" : {
            "field" : "suggest",
            "size": 5
        },
    }
}'

Maybe use dot-notation to expand the JSON object?

After starting the platform, the finder logs drive crazy giving lots and lots of logs as shown in the screenshot. This makes it hard to see logs from the other services so I usually miss the debug logs of my own service and have to restart the entire platform.
However, the platform works fine.

Use text instead of string field

When starting reference-implementation, I get the following error:

elasticsearch2_1  | [2018-05-17T09:18:44,516][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1  | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1  | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]
elasticsearch2_1  | [2018-05-17T09:18:44,520][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1  | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1  | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]

Put compendia and jobs into different indices instead of using types

Compendia and jobs are completely different. Here are the Elasticsearch mappings (via http://localhost:9200/_mapping):

"Do your documents have similar mappings? If no, use different indices." via https://www.elastic.co/blog/index-vs-type

So, they should go into indices o2r-compendia and o2r-jobs, and the /search endpoint should have a property docs (?) to search for only specific or both, see https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-search.html#search-multi-index-type

Queries to be supported:

../search?q=my-search-term&resource=all (default)
../search?q=my-search-term&resource=job
../search?q=my-search-term&resource=compendium
../search?q=my-search-term&resource=compendium,job (effectively the same as "all")

Request entity too large

When the HTML documents become large, the BULK upload can run into troubles, see log below.

upload many papers from the test corpus
find out if this is a server or a client issue
test with reduced bulk size

ESMongoSync: Oplog tailing connection successful.
finder_1          | ESMongoSync: Processing watchers on priority level  2
finder_1          | ESMongoSync: Processing  jobs  collection
finder_1          | ESMongoSync: Batch creation complete. Processing...
finder_1          | ESMongoSync: Number of documents in batch -  20
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1          | { Error: Request Entity Too Large
finder_1          |     at respond (/finder/node_modules/elasticsearch/src/lib/transport.js:307:15)
finder_1          |     at checkRespForFailure (/finder/node_modules/elasticsearch/src/lib/transport.js:266:7)
finder_1          |     at HttpConnector.<anonymous> (/finder/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
finder_1          |     at IncomingMessage.bound (/finder/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
finder_1          |     at emitNone (events.js:111:20)
finder_1          |     at IncomingMessage.emit (events.js:208:7)
finder_1          |     at endReadableNT (_stream_readable.js:1055:12)
finder_1          |     at _combinedTickCallback (internal/process/next_tick.js:138:11)
finder_1          |     at process._tickCallback (internal/process/next_tick.js:180:9)
finder_1          |   status: 413,
finder_1          |   displayName: 'RequestEntityTooLarge',
finder_1          |   message: 'Request Entity Too Large',
finder_1          |   path: '/_bulk',
finder_1          |   query: {},
finder_1          |   body: '{"index":{"_index":"compendia","_type":"compendia","_id":"5a9fe27b888b05001c5ffb66"}}\n{"createdAt":"2018-03-07T13:00:43.032Z","updatedAt":"2018-03-07T13:00:43.032Z","id":"5a9fe27b888b05001c5ffb66","user":"0000-0001-6225-344X","metadata":{"o2r":{"upload_type":"publication","title":"Capacity of container ships in seaborne trade from 1980 to 2016 (in million dwt)*","temporal":{"end":"2017-03-07T00:00:00","begin":"2017-03-07T00:00:00"},"spatial":{"union":{"bbox":[[181,181],[-181,181],[-181,-181],[181,-181]]},"files":[]},"publication_type":"other","publication_date":"2018-03-07","paperLanguage":[],"mainfile_candidates":["main.Rmd"],"mainfile":"main.Rmd","license":{"uibindings":null,"text":null,"md":null,"data":null,"code":null},"keywords":["container","ship","trade","statistic"],"interaction":[],"inputfiles":["data.csv"],"identifier":{"reserveddoi":null,"doiurl":"https://doi.org/10.5555/666655554444","doi":"10.5555/666655554444"},"ercIdentifier":"Q8AKA","displayfile_candidates":["display.html"],"displayfile":"display.html","description":"Capacity of container ships in seaborne trade of the world container ship fleet.\\n","depends":[],"creators":[{"orcid":"0000-0002-0024-5046","name":"Daniel Nüst","affiliation":"o2r team"}],"communities":[{"identifier":"o2r"}],"codefiles":["main.Rmd"],"access_right":"open"}},"substituted":false,"compendium":false,"bag":false,"candidate":true,"jobs":[],"created":"2018-03-07T13:00:43.032Z","compendium_id":"Q8AKA","files":{"path":"/api/v1/compendium/Q8AKA/data","name":"Q8AKA","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc","name":".erc","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"file"}],"size":4979,"type":"directory"},{"path":"/api/v1/compendium/Q8AKA/data/data.csv","name":"data.csv","size":122,"extension":".csv","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/display.html","name":"display.html","size":651313,"extension":".html","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/main.Rmd","name":"main.Rmd","size":1117,"extension":".rmd","type":"file"}],"size":657531,"type":"directory"},"texts":[{"path":"/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"application/json"},{"path":"/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"application/json"},{"path":"/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"application/json"},{"path":"/data.csv","name":"data.csv","size":122,"extension":".csv","type":"text/csv","content":"\\"year\\",\\"capacity\\"\\n\\"1980\\",11\\n\\"1985\\",20\\n\\"1990\\",26\\n\\"1995\\",44\\n\\"2000\\",64\\n\\"2005\\",98\\n\\"2010\\",169\\n\\"2014\\",216\\n\\"2015\\",228\\n\\"2016\\",244\\n"},{"path":"/display.html","name":"display.html","size":651313,"extension":".html","type":"text/html","content":"<!DOCTYPE html>\\n\\n<html xmlns=\\"http://www.w3.org/1999/xhtml\\">\\n\\n<head>\\n\\n<meta charset=\\"utf-8\\" />\\n<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=utf-8\\" />\\n<meta name=\\"generator\\" content=\\"pandoc\\" />\\n\\n\\n\\n<meta name=\\"date\\" content=\\"2017-01-01\\" />\\n\\n<title>Capacity of container ships in s

[...]

Fix real-time indexing

It seems newly added documents are processed by finder but they do not appear in Elasticsearch.

Random crashing off the finder

When starting the new platform with docker-composeup or when uploading ERCs, it sometimes happens (could not find any regularity) that the Finder crashes. After the crash, the API endpoint search localhost/api/v1/search is no longer accessible. All other API endpoints are still accessible. Markus has the same error and sudo sysctl -q -w vm.max_map_count=262144 has been set.

Add tests

Add some integration tests:

after creating a new compendium, check if all metadata arrived in the search database correctly.
...

To get them running on Travis CI, the existing configurations (e.g. o2r-substituter) should help a lot.

Index text files of a compendium

The text of all files within a compendium should be indexed for full text search.

Evaluate potential approaches:

read text files and index them in a "full text" field?
handle them as attachments?