The dlcs-search-service from digirati-co-uk

A very basic UI that can interface over the different endpoints for quick operations against a configured elucidate.

Potential features:

Bulk delete
Bulk update (creator, URL domain)
Search query to IIIF annotation list (to be attached manually)

IIIF Content Search: Target Resource Structure

https://iiif.io/api/search/1.0/#target-resource-structure

The annotations may also include references to the structure or structures that the target (the resource in the on property) is found within. The URI and type of the including resource must be given, and a label should be included.

For this service:

manifest (for annotations on canvases) should be included using the within property.

Output: IIIF Change Discovery Activity Stream

Potentially, the list of IIIF resources that meet a particular search query could be output as:

https://preview.iiif.io/api/discovery-context/api/discovery/0.2/#iiif-change-discovery-api-0-2

Output a list of resources as a Change Discovery set that meets the query.
Hook into the central event register to produce a Change Discovery set with historic events for just the resources that meet the query.

Indexing: IIIF Presentation API Manifests - seeAlso

https://iiif.io/api/presentation/2.1/#seealso

The seeAlso provides a mechanism for linking to machine readable content, with a profile. Advanced search use cases can be delivered by indexing machine-readable metadata with semantics, via the seeAlso property of IIIF resources.

This is not an MVP feature.

Potentially, this might include such formats as:

Integration: non-Iris REST API

Basic RESTful API for CRUD operations on the annotation index.

add a new annotation to the index
update an existing annotation in the index
delete an existing annotation in the index

IIIF Content Search: Search Hit Highlighting

https://iiif.io/api/search/1.0/#search-term-highlighting

The client ... needs to know the text that caused the service to create the hit, and enough information about where it occurs in the content to reliably highlight it and not highlight non-matches. To do this, the service can supply text before and after the matching term within the content of the annotation, via an Open Annotation TextQuoteSelector object. TextQuoteSelectors have three properties: exact to record the exact text to look for, prefix with some text before the match, and suffix with some text after the match.

TextQuoteSelector: for search term highlighting

Support different grouping

When you make a query to the search service with a query (however that works) it would be nice to be able to request a particular grouping from the service. Could be driven by different endpoints. For example:

Group by creator
Group by manifest (target)
Group by canvas (target)
Group by Entity (body)

Ideas of request responses in this case:

{
  "items": [
    {
      "id": "https://omeka.org/api/user/1", 
      "name": "Digirati",
      "annotations": {
        "id": "http://search-service.org/dereferencable-anno-list",
        "type": "AnnotationPage",
        "total": 123
      }
    }
  ]
}

Consider access control concerns

Visibility of indexed objects to particular groups/roles (might be tags)

Is the existence of a result knowable, even if you can't see its body?

Consider in light of Ghent requirements

Indexing: IIIF Presentation API Manifests

Search use cases beyond IIIF Content Search "search within", for example:

"I want to see all the pre-20th century archival records that contain 'Navajo'"

or

"Which documents from this archival series have been tagged with 'Paris'"?"

or

"Find 'John Smith' in records from 1929"

require indexing of IIIF Presentation API content, not just annotations.

Potentially index IIIF Presentation API content:

Output: IIIF Resources: Collections / Manifests

Potentially output a IIIF Collection of resources that match a particular query that can be loaded into a IIIF viewer.

Or potentially, a manifest that comprises just canvases that meet a particular set of search criteria.

IIIF Collection
IIIF Manifest

Modularity: minimal startup

The system should be extremely modular, and a "core" minimal deployment should be possible with just support for IIIF Content Search features alone.

Stateless start-up

When spun up with a configured Elucidate (or S3) it can bring itself to a working-state from scratch using Bulk-ingest. Should also allow it to scale horizontally.

Indexing: IIIF Presentation API features of annotations - temporal

Needs refinement.

How should we think about temporal ranges? That is, should we be able to query for annotations on a particular time segment of time based media? Or is this client side?

Indexing: Editorial/contextual content

IIIF Presentation API objects, and annotations typically have a context, for example:

A curated exhibition
A scholarly platform with descriptive content
An institutional repository

Some search use cases, for example, search across some context of discovery (platform/exhibition, etc) may require indexing associated content such as associated HTML.

Ocracoke and Stanford Content Search Service Evaluation: Evaluation Spike

Evaluate for quality, reusability, inspiration:

Consider relationship between coordinate service and search index.

DLCS approach: separate coordinate and search services
Others: combined coordinate and text index.

Non functional requirement: support via unicode for non-Roman scripts and languages

The service must support non-Western languages and scripts, including, but not limited to:

arabic
chinese
cyrillic

IIIF Content Search: Autocomplete Service

https://iiif.io/api/search/1.0/#autocomplete

The autocomplete service returns terms that can be added into the q parameter of the related search service, given the first characters of the term.

The service should support all of the parameters for the search query, see: #6

plus the additional min parameter: https://iiif.io/api/search/1.0/#query-parameters-1

Integration: support for containerised message bus

See: #3

DLCS use cases assume the use of Iris as a message bus with a dependency on AWS SNS and AWS SQS.

For local deployment, testing, and potential containerised deployments, the service should support:

Indexing: W3C Web Annotations - Annotation Studio Drafts

Is it worth considering indexing the Annotation Studio draft annotations?

We have existing code, produced for NLW, that can parse these and convert them into something indexable.

Benefits:

We can build in discovery / browse / search while retaining editable annotations
No requirement to bulk convert annotations at project end
Annotation lists produced via queries (against a IIIF resource: canvas, manifest, range, etc) which return OA or vanilla W3C web annotations can be used as the dissemination copy.

IIIF Content Search: Snippets

https://iiif.io/api/search/1.0/#search-term-snippets

The simplest addition to the hit object is to add text that appears before and after the matching text in the annotation.

The service may add a before property to the hit with some amount of text that appears before the content of the annotation (given in chars), and may also add an after property with some amount of text that appears after the content of the annotation.

ocr: from Starsky (or other source), add before and after snippets.
textual annotations: add before and after snippets.
- commenting
- transcribing
- translating

Raw access to query language

A post endpoint that will allow read access to the Elasticsearch, maybe with optional formatting for the response (like grouping or formatting in IIIF-compatible way).

Performance: Spike

Questions to consider:

Notes:

https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=fortune&l=zijzen-1
https://fgimian.github.io/blog/2018/05/17/choosing-a-fast-python-api-framework/

IIIF Content Search: Simple Annotation List response

https://iiif.io/api/search/1.0/#simple-lists

The simplest response looks exactly like a regular annotation list, where all of the matching annotations are returned in a single response.

The full annotation description must be included in the response, even if the annotations are separately dereferenceable via their URIs.

Indexing: W3C Web Annotations

The service should support the indexing of W3C web annotations:

The Web Annotation Data Model is complex, so an MVP IIIF Content Search service will probably not support sophisticated granular queries into annotation content.

For MVP it should probably support:

The current Mathmos Elasticsearch index for Web Annotations looks like:

PUT /w3cannotation
{
    "mappings": {
      "annotations": {
        "properties": {
          "body": {
            "type": "text"
          },
          "bodyURI": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "created": {
            "type": "date"
          },
          "creators": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "generated": {
            "type": "date"
          },
          "generator": {
            "type": "text"
          },
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "manifest": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "modified": {
            "type": "date"
          },
          "motivations": {
            "type": "text"
          },
          "oaJsonLd": {
            "type": "text",
            "index": false
          },
          "paredDownOaJsonLd": {
            "type": "text",
            "index": false
          },
          "suggest": {
            "type": "completion",
            "analyzer": "simple",
            "preserve_separators": true,
            "preserve_position_increments": true,
            "max_input_length": 100,
            "contexts": [
              {
                "name": "manifest",
                "type": "CATEGORY"
              }
            ]
          },
          "target": {
            "type": "text"
          },
          "targetURI": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "uri": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "w3cJsonLd": {
            "type": "text",
            "index": false
          },
          "xywh": {
            "type": "text"
          }
        }
      }
    }
}

Dockerised version - with elucidate

Some way to spin up the search service and elucidate together with new annotations sent to elucidate also being sent to the search service. Would allow for quick local search services to be available on smaller projects. May also allow 3rd party content to be run through montague pointing to a dockerised elucidate to be ingested too in isolation.

Elasticsearch Annotated Text: Spike

Evaluate the new ES Annotated Text features to see if they offer any benefits around:

Functionality
Performance

ES6.5 features in general.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-annotated-text.html
https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-annotated-text-usage.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.x/release-notes-6.5.0.html#feature-6.5.0

IIIF Content Search: Paged Annotation Lists

https://iiif.io/api/search/1.0/#paging-results

Paging follows: https://iiif.io/api/presentation/2.1/#paging

Fixtures: sample annotations, full text

Create a set of test fixtures and samples that can be used to validate the service.

Full text content from OCR service, including non-Roman script.
Full text content from transcription, including non-Roman script.
Tagging annotations (IDA model)
Tagging annotations (RS model)
Draft format annotations (Anno studio: NLW models)
W3C web annotations (serialised non-draft format from Anno Studio)
Manifests and canvases for the above

Integration: Iris message bus

The service should support CRUD events via the Iris message service.

index new annotation
update existing annotation
delete existing annotation

IIIF Content Search: Query parameters

The service should support the standard IIIF Content Search Parameters.

https://iiif.io/api/search/1.0/#query-parameters

q: A space separated list of search terms.
motivation: A space separated list of motivation terms.
user: A space separated list of URIs that are the identities of users.
date: A space separated list of date ranges. In ISO8601 format, YYYY-MM-DDThh:mm:ssZ/YYYY-MM-DDThh:mm:ssZ.

Indexing: OCR Text

Index plain text from OCR, to provide:

#10
#12
#11
#6

The current Mathmos index in Elasticsearch looks like:

PUT /text_index
{
"mappings": {
      "plaintext": {
        "properties": {
          "endPositionOfCurrentText": {
            "type": "integer",
            "index": false
          },
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "imageId": {
            "type": "keyword",
            "index": false
          },
          "manifestId": {
            "type": "keyword"
          },
          "nextCanvasId": {
            "type": "keyword",
            "index": false
          },
          "nextImageId": {
            "type": "keyword",
            "index": false
          },
          "plaintext": {
            "type": "text",
            "term_vector": "with_positions_offsets"
          },
          "suggest": {
            "type": "completion",
            "analyzer": "simple",
            "preserve_separators": true,
            "preserve_position_increments": true,
            "max_input_length": 100,
            "contexts": [
              {
                "name": "manifest",
                "type": "CATEGORY"
              }
            ]
          }
        }
      }
    }
 }

digirati-co-uk / dlcs-search-service Goto Github PK

dlcs-search-service's People

Contributors

Stargazers

Watchers

dlcs-search-service's Issues

Recommend Projects

Recommend Topics

Recommend Org