digirati-co-uk / dlcs-search-service Goto Github PK
View Code? Open in Web Editor NEWSearch service for IIIF Content Search and annotation indexing.
License: MIT License
Search service for IIIF Content Search and annotation indexing.
License: MIT License
A very basic UI that can interface over the different endpoints for quick operations against a configured elucidate.
Potential features:
https://iiif.io/api/search/1.0/#target-resource-structure
The annotations may also include references to the structure or structures that the target (the resource in the on property) is found within. The URI and type of the including resource must be given, and a label should be included.
For this service:
within
property.Potentially, the list of IIIF resources that meet a particular search query could be output as:
https://preview.iiif.io/api/discovery-context/api/discovery/0.2/#iiif-change-discovery-api-0-2
https://iiif.io/api/presentation/2.1/#seealso
The seeAlso provides a mechanism for linking to machine readable content, with a profile. Advanced search use cases can be delivered by indexing machine-readable metadata with semantics, via the seeAlso property of IIIF resources.
This is not an MVP feature.
Potentially, this might include such formats as:
Basic RESTful API for CRUD operations on the annotation index.
https://iiif.io/api/search/1.0/#search-term-highlighting
The client ... needs to know the text that caused the service to create the hit, and enough information about where it occurs in the content to reliably highlight it and not highlight non-matches. To do this, the service can supply text before and after the matching term within the content of the annotation, via an Open Annotation TextQuoteSelector object. TextQuoteSelectors have three properties: exact to record the exact text to look for, prefix with some text before the match, and suffix with some text after the match.
When you make a query to the search service with a query (however that works) it would be nice to be able to request a particular grouping from the service. Could be driven by different endpoints. For example:
Ideas of request responses in this case:
{
"items": [
{
"id": "https://omeka.org/api/user/1",
"name": "Digirati",
"annotations": {
"id": "http://search-service.org/dereferencable-anno-list",
"type": "AnnotationPage",
"total": 123
}
}
]
}
Visibility of indexed objects to particular groups/roles (might be tags)
Is the existence of a result knowable, even if you can't see its body?
Consider in light of Ghent requirements
Search use cases beyond IIIF Content Search "search within", for example:
"I want to see all the pre-20th century archival records that contain 'Navajo'"
or
"Which documents from this archival series have been tagged with 'Paris'"?"
or
"Find 'John Smith' in records from 1929"
require indexing of IIIF Presentation API content, not just annotations.
Potentially index IIIF Presentation API content:
Potentially output a IIIF Collection of resources that match a particular query that can be loaded into a IIIF viewer.
Or potentially, a manifest that comprises just canvases that meet a particular set of search criteria.
The system should be extremely modular, and a "core" minimal deployment should be possible with just support for IIIF Content Search features alone.
When spun up with a configured Elucidate (or S3) it can bring itself to a working-state from scratch using Bulk-ingest. Should also allow it to scale horizontally.
Needs refinement.
How should we think about temporal ranges? That is, should we be able to query for annotations on a particular time segment of time based media? Or is this client side?
IIIF Presentation API objects, and annotations typically have a context, for example:
Some search use cases, for example, search across some context of discovery (platform/exhibition, etc) may require indexing associated content such as associated HTML.
Evaluate for quality, reusability, inspiration:
Consider relationship between coordinate service and search index.
DLCS approach: separate coordinate and search services
Others: combined coordinate and text index.
The service must support non-Western languages and scripts, including, but not limited to:
https://iiif.io/api/search/1.0/#autocomplete
The autocomplete service returns terms that can be added into the q parameter of the related search service, given the first characters of the term.
The service should support all of the parameters for the search query, see: #6
plus the additional min
parameter: https://iiif.io/api/search/1.0/#query-parameters-1
min
parameter.See: #3
DLCS use cases assume the use of Iris as a message bus with a dependency on AWS SNS and AWS SQS.
For local deployment, testing, and potential containerised deployments, the service should support:
Is it worth considering indexing the Annotation Studio draft annotations?
We have existing code, produced for NLW, that can parse these and convert them into something indexable.
Benefits:
https://iiif.io/api/search/1.0/#search-term-snippets
The simplest addition to the hit object is to add text that appears before and after the matching text in the annotation.
The service may add a before property to the hit with some amount of text that appears before the content of the annotation (given in chars), and may also add an after property with some amount of text that appears after the content of the annotation.
A post endpoint that will allow read access to the Elasticsearch, maybe with optional formatting for the response (like grouping or formatting in IIIF-compatible way).
Questions to consider:
Notes:
https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=fortune&l=zijzen-1
https://fgimian.github.io/blog/2018/05/17/choosing-a-fast-python-api-framework/
https://iiif.io/api/search/1.0/#simple-lists
The simplest response looks exactly like a regular annotation list, where all of the matching annotations are returned in a single response.
The full annotation description must be included in the response, even if the annotations are separately dereferenceable via their URIs.
The service should support the indexing of W3C web annotations:
The Web Annotation Data Model is complex, so an MVP IIIF Content Search service will probably not support sophisticated granular queries into annotation content.
For MVP it should probably support:
The current Mathmos Elasticsearch index for Web Annotations looks like:
PUT /w3cannotation
{
"mappings": {
"annotations": {
"properties": {
"body": {
"type": "text"
},
"bodyURI": {
"type": "text",
"analyzer": "whitespace"
},
"created": {
"type": "date"
},
"creators": {
"type": "text",
"analyzer": "whitespace"
},
"generated": {
"type": "date"
},
"generator": {
"type": "text"
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"manifest": {
"type": "text",
"analyzer": "whitespace"
},
"modified": {
"type": "date"
},
"motivations": {
"type": "text"
},
"oaJsonLd": {
"type": "text",
"index": false
},
"paredDownOaJsonLd": {
"type": "text",
"index": false
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 100,
"contexts": [
{
"name": "manifest",
"type": "CATEGORY"
}
]
},
"target": {
"type": "text"
},
"targetURI": {
"type": "text",
"analyzer": "whitespace"
},
"uri": {
"type": "text",
"analyzer": "whitespace"
},
"w3cJsonLd": {
"type": "text",
"index": false
},
"xywh": {
"type": "text"
}
}
}
}
}
Some way to spin up the search service and elucidate together with new annotations sent to elucidate also being sent to the search service. Would allow for quick local search services to be available on smaller projects. May also allow 3rd party content to be run through montague pointing to a dockerised elucidate to be ingested too in isolation.
Evaluate the new ES Annotated Text features to see if they offer any benefits around:
ES6.5 features in general.
https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-annotated-text.html
https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-annotated-text-usage.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/release-notes-6.5.0.html#feature-6.5.0
Create a set of test fixtures and samples that can be used to validate the service.
The service should support CRUD events via the Iris message service.
The service should support the standard IIIF Content Search Parameters.
https://iiif.io/api/search/1.0/#query-parameters
Index plain text from OCR, to provide:
The current Mathmos index in Elasticsearch looks like:
PUT /text_index
{
"mappings": {
"plaintext": {
"properties": {
"endPositionOfCurrentText": {
"type": "integer",
"index": false
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"imageId": {
"type": "keyword",
"index": false
},
"manifestId": {
"type": "keyword"
},
"nextCanvasId": {
"type": "keyword",
"index": false
},
"nextImageId": {
"type": "keyword",
"index": false
},
"plaintext": {
"type": "text",
"term_vector": "with_positions_offsets"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 100,
"contexts": [
{
"name": "manifest",
"type": "CATEGORY"
}
]
}
}
}
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.