o2r-project / o2r-finder Goto Github PK
View Code? Open in Web Editor NEWNode.js implementation of search features for the o2r API
License: Apache License 2.0
Node.js implementation of search features for the o2r API
License: Apache License 2.0
Important: Using golang is not about performance, but about trying out sth. new 😄
On analyzers: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
@LukasLohoff does this make sense?
mongo-connector
seems best tool, but runs in Python, would require own container > keep in mind for later
While writing the tests for the finder, I noticed that documents in the index are only deleted when the finder is started.
When doing for example a db.drop()
call, the documents which were deleted in the database remain in the elasticsearch index.
Given a compendium id, the finder can provide a list of similar compendia based on upcoming research.
This list could be integrated into a UI, for example:
The API could take an identifier and return a list of identifiers and scores, or just proxy through the Elasticsearch response (i.e. include the actual metadata).
GET /api/v1/explore/Xid4U or GET /api/v1/compendium/Xid4U/similar
{ [
{ "id": "12345", score: 8.342 },
{ "id": "abcdf", score: 5.42 },
{ "id": "12ab1", score: 1.42 },
{ "id": "qwert", score: 1.00 },
{ "id": "asdfg", score: 0.1742 }
] }
Alternatively the API endpoint could re-use /search
, but only if the response structure is the same, i.e. GET /api/v1/search?similar=Xid4U
.
A verbose response could provide access to the sub-scores:
GET /api/v1/explore/Xid4U?verbose
{
"results": [
{"id": "aiejf", "score": 0.8813, "spatial": 0.2412, "code": 0.991, "text": 0.5 },
{"id": "izxye", "score": 0.6713, ...},
{...}
]
}
Using a further parameter, only one or more specific sub-scores can be taken into account (and only the used ones are in the response):
/api/v1/explore/Xid4U?component=code
/api/v1/explore/Xid4U?component=text,data
/api/v1/explore/Xid4U?component=text,data,spatial,temporal
/api/v1/explore/Xid4U?component=all (default)
componenent
implies verbose
.
Thanks @LukasLohoff @jansule for collecting first ideas on this!
We can expect people to submitting URLs and URIs, and also DOIs, to the search endpoint:
https://o2r.uni-muenster.de/api/v1/search?q=10.5555%2F12345678
Currently this returns an error:
failed_shards": [
{
"shard": 0,
"index": "o2r",
"node": "XMeUHWUmRyepQQbbDdRXeA",
"reason": {
"type": "query_shard_exception",
"reason": "Failed to parse query [10.5555/12345678]",
"index_uuid": "AWf5E-1_TrmdyKneJHYsnA",
"index": "o2r",
"caused_by": {
"type": "parse_exception",
"reason": "Cannot parse '10.5555/12345678': Lexical error at line 1, column 17. Encountered: <EOF> after : \"/12345678\"",
"caused_by": {
"type": "token_mgr_error",
"reason": "Lexical error at line 1, column 17. Encountered: <EOF> after : \"/12345678\""
}
}
}
}
]
This should be handled by finder, ideally by supporting the search for DOIs.
The complete Elasticsearch error messages should never be exposed, e.g.
https://o2r.uni-muenster.de/api/v1/search?q=not/supported
may contain the error reason in the response but not the full structure. There's no need to hide we use Elasticsearch, but we should not expose information and shards etc.
Later we might leverage an advanced search engine (i.e. Elasticsearch), but for first prototypes MongoDB's full text search should suffice.
Resources:
Calling http://.../api/v1/search?q=word
should return a list of compendia matching "word"
Filters:
...?q=term&type={compendium,job}
(would be interesting to index job log outputs!)Evaluate approaches
.erc/
suggests
fieldcurl -XPOST '.../api/v1/suggest' -d'
{
"suggest" : {
"prefix" : "interp",
"completion" : {
"size": 5
},
}
}'
Returns the suggest part of the Elasticsearch completion suggester response (with internal fields removed, like _index
, _id
etc.):
{
"suggest" : [ {
"text" : "interp",
"offset" : 0,
"length" : 6,
"options" : [ {
"text" : "interpolation"
}, {
"text" : "interpolate"
}{
"text" : "international"
}{
"text" : "uninterpreted"
} ]
} ]
}
size
is optional and defaults to 5
, so the completion
field can also be missing completely.
Internally, the query above is simple extended with the configured suggest field to
{
"suggest" : {
"prefix" : "interp",
"completion" : {
"field" : "suggest",
"size": 5
},
}
}'
Maybe use dot-notation to expand the JSON object?
When starting reference-implementation, I get the following error:
elasticsearch2_1 | [2018-05-17T09:18:44,516][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1 | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1 | [2018-05-17T09:18:44,517][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]
elasticsearch2_1 | [2018-05-17T09:18:44,520][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [_special]
elasticsearch2_1 | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doi]
elasticsearch2_1 | [2018-05-17T09:18:44,521][WARN ][o.e.d.i.m.StringFieldMapper$TypeParser] The [string] field is deprecated, please use [text] or [keyword] instead on [doiurl]
Compendia and jobs are completely different. Here are the Elasticsearch mappings (via http://localhost:9200/_mapping):
"Do your documents have similar mappings? If no, use different indices." via https://www.elastic.co/blog/index-vs-type
So, they should go into indices o2r-compendia
and o2r-jobs
, and the /search
endpoint should have a property docs
(?) to search for only specific or both, see https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-search.html#search-multi-index-type
Queries to be supported:
../search?q=my-search-term&resource=all (default)
../search?q=my-search-term&resource=job
../search?q=my-search-term&resource=compendium
../search?q=my-search-term&resource=compendium,job (effectively the same as "all")
When the HTML documents become large, the BULK upload can run into troubles, see log below.
ESMongoSync: Oplog tailing connection successful.
finder_1 | ESMongoSync: Processing watchers on priority level 2
finder_1 | ESMongoSync: Processing jobs collection
finder_1 | ESMongoSync: Batch creation complete. Processing...
finder_1 | ESMongoSync: Number of documents in batch - 20
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | ESMongoSync: Warning. "id" field already exists, creating a suitable identifier "id" (default: using "_id") to enable document updates must be handled in user-defined transform function.
finder_1 | { Error: Request Entity Too Large
finder_1 | at respond (/finder/node_modules/elasticsearch/src/lib/transport.js:307:15)
finder_1 | at checkRespForFailure (/finder/node_modules/elasticsearch/src/lib/transport.js:266:7)
finder_1 | at HttpConnector.<anonymous> (/finder/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
finder_1 | at IncomingMessage.bound (/finder/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
finder_1 | at emitNone (events.js:111:20)
finder_1 | at IncomingMessage.emit (events.js:208:7)
finder_1 | at endReadableNT (_stream_readable.js:1055:12)
finder_1 | at _combinedTickCallback (internal/process/next_tick.js:138:11)
finder_1 | at process._tickCallback (internal/process/next_tick.js:180:9)
finder_1 | status: 413,
finder_1 | displayName: 'RequestEntityTooLarge',
finder_1 | message: 'Request Entity Too Large',
finder_1 | path: '/_bulk',
finder_1 | query: {},
finder_1 | body: '{"index":{"_index":"compendia","_type":"compendia","_id":"5a9fe27b888b05001c5ffb66"}}\n{"createdAt":"2018-03-07T13:00:43.032Z","updatedAt":"2018-03-07T13:00:43.032Z","id":"5a9fe27b888b05001c5ffb66","user":"0000-0001-6225-344X","metadata":{"o2r":{"upload_type":"publication","title":"Capacity of container ships in seaborne trade from 1980 to 2016 (in million dwt)*","temporal":{"end":"2017-03-07T00:00:00","begin":"2017-03-07T00:00:00"},"spatial":{"union":{"bbox":[[181,181],[-181,181],[-181,-181],[181,-181]]},"files":[]},"publication_type":"other","publication_date":"2018-03-07","paperLanguage":[],"mainfile_candidates":["main.Rmd"],"mainfile":"main.Rmd","license":{"uibindings":null,"text":null,"md":null,"data":null,"code":null},"keywords":["container","ship","trade","statistic"],"interaction":[],"inputfiles":["data.csv"],"identifier":{"reserveddoi":null,"doiurl":"https://doi.org/10.5555/666655554444","doi":"10.5555/666655554444"},"ercIdentifier":"Q8AKA","displayfile_candidates":["display.html"],"displayfile":"display.html","description":"Capacity of container ships in seaborne trade of the world container ship fleet.\\n","depends":[],"creators":[{"orcid":"0000-0002-0024-5046","name":"Daniel Nüst","affiliation":"o2r team"}],"communities":[{"identifier":"o2r"}],"codefiles":["main.Rmd"],"access_right":"open"}},"substituted":false,"compendium":false,"bag":false,"candidate":true,"jobs":[],"created":"2018-03-07T13:00:43.032Z","compendium_id":"Q8AKA","files":{"path":"/api/v1/compendium/Q8AKA/data","name":"Q8AKA","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc","name":".erc","children":[{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"file"}],"size":4979,"type":"directory"},{"path":"/api/v1/compendium/Q8AKA/data/data.csv","name":"data.csv","size":122,"extension":".csv","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/display.html","name":"display.html","size":651313,"extension":".html","type":"file"},{"path":"/api/v1/compendium/Q8AKA/data/main.Rmd","name":"main.Rmd","size":1117,"extension":".rmd","type":"file"}],"size":657531,"type":"directory"},"texts":[{"path":"/.erc/metadata_o2r_1.json","name":"metadata_o2r_1.json","size":1916,"extension":".json","type":"application/json"},{"path":"/.erc/metadata_raw.json","name":"metadata_raw.json","size":2654,"extension":".json","type":"application/json"},{"path":"/.erc/package_slip.json","name":"package_slip.json","size":409,"extension":".json","type":"application/json"},{"path":"/data.csv","name":"data.csv","size":122,"extension":".csv","type":"text/csv","content":"\\"year\\",\\"capacity\\"\\n\\"1980\\",11\\n\\"1985\\",20\\n\\"1990\\",26\\n\\"1995\\",44\\n\\"2000\\",64\\n\\"2005\\",98\\n\\"2010\\",169\\n\\"2014\\",216\\n\\"2015\\",228\\n\\"2016\\",244\\n"},{"path":"/display.html","name":"display.html","size":651313,"extension":".html","type":"text/html","content":"<!DOCTYPE html>\\n\\n<html xmlns=\\"http://www.w3.org/1999/xhtml\\">\\n\\n<head>\\n\\n<meta charset=\\"utf-8\\" />\\n<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=utf-8\\" />\\n<meta name=\\"generator\\" content=\\"pandoc\\" />\\n\\n\\n\\n<meta name=\\"date\\" content=\\"2017-01-01\\" />\\n\\n<title>Capacity of container ships in s
[...]
It seems newly added documents are processed by finder but they do not appear in Elasticsearch.
When starting the new platform with docker-compose
up or when uploading ERCs, it sometimes happens (could not find any regularity) that the Finder crashes. After the crash, the API endpoint search localhost/api/v1/search is no longer accessible. All other API endpoints are still accessible. Markus has the same error and sudo sysctl -q -w vm.max_map_count=262144
has been set.
Add some integration tests:
To get them running on Travis CI, the existing configurations (e.g. o2r-substituter) should help a lot.
The text of all files within a compendium should be indexed for full text search.
Evaluate potential approaches:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.