Coder Social home page Coder Social logo

dlcs-search-service's People

Contributors

mattmcgrattan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlcs-search-service's Issues

Basic UI

A very basic UI that can interface over the different endpoints for quick operations against a configured elucidate.

Potential features:

  • Bulk delete
  • Bulk update (creator, URL domain)
  • Search query to IIIF annotation list (to be attached manually)

Indexing: IIIF Presentation API Manifests - seeAlso

https://iiif.io/api/presentation/2.1/#seealso

The seeAlso provides a mechanism for linking to machine readable content, with a profile. Advanced search use cases can be delivered by indexing machine-readable metadata with semantics, via the seeAlso property of IIIF resources.

This is not an MVP feature.

  • Manifest / Sequence / Range / Canvas:
    • seeAlso:
      • identify profile
      • identify appropriate schema
      • index content

Potentially, this might include such formats as:

  • Dublin Core
  • Schema.org
  • MODS
  • Marc21
  • Bibframe
  • CIDOC-CRM
  • EAD
  • TEI-XML
  • plain text files.

Integration: non-Iris REST API

Basic RESTful API for CRUD operations on the annotation index.

  • add a new annotation to the index
  • update an existing annotation in the index
  • delete an existing annotation in the index

IIIF Content Search: Search Hit Highlighting

https://iiif.io/api/search/1.0/#search-term-highlighting

The client ... needs to know the text that caused the service to create the hit, and enough information about where it occurs in the content to reliably highlight it and not highlight non-matches. To do this, the service can supply text before and after the matching term within the content of the annotation, via an Open Annotation TextQuoteSelector object. TextQuoteSelectors have three properties: exact to record the exact text to look for, prefix with some text before the match, and suffix with some text after the match.

  • TextQuoteSelector: for search term highlighting

Support different grouping

When you make a query to the search service with a query (however that works) it would be nice to be able to request a particular grouping from the service. Could be driven by different endpoints. For example:

  • Group by creator
  • Group by manifest (target)
  • Group by canvas (target)
  • Group by Entity (body)

Ideas of request responses in this case:

{
  "items": [
    {
      "id": "https://omeka.org/api/user/1", 
      "name": "Digirati",
      "annotations": {
        "id": "http://search-service.org/dereferencable-anno-list",
        "type": "AnnotationPage",
        "total": 123
      }
    }
  ]
}

Consider access control concerns

Visibility of indexed objects to particular groups/roles (might be tags)

Is the existence of a result knowable, even if you can't see its body?

Consider in light of Ghent requirements

Indexing: IIIF Presentation API Manifests

Search use cases beyond IIIF Content Search "search within", for example:

"I want to see all the pre-20th century archival records that contain 'Navajo'"

or

"Which documents from this archival series have been tagged with 'Paris'"?"

or

"Find 'John Smith' in records from 1929"

require indexing of IIIF Presentation API content, not just annotations.

Potentially index IIIF Presentation API content:

  • Manifest / Sequence / Range / Canvas:
    • label
    • description
    • metadata fields (treated as strings, not as structured data with semantics)
    • attribution
  • NavDate

Output: IIIF Resources: Collections / Manifests

Potentially output a IIIF Collection of resources that match a particular query that can be loaded into a IIIF viewer.

Or potentially, a manifest that comprises just canvases that meet a particular set of search criteria.

  • IIIF Collection
  • IIIF Manifest

Modularity: minimal startup

The system should be extremely modular, and a "core" minimal deployment should be possible with just support for IIIF Content Search features alone.

Stateless start-up

When spun up with a configured Elucidate (or S3) it can bring itself to a working-state from scratch using Bulk-ingest. Should also allow it to scale horizontally.

Indexing: Editorial/contextual content

IIIF Presentation API objects, and annotations typically have a context, for example:

  • A curated exhibition
  • A scholarly platform with descriptive content
  • An institutional repository

Some search use cases, for example, search across some context of discovery (platform/exhibition, etc) may require indexing associated content such as associated HTML.

Integration: support for containerised message bus

See: #3

DLCS use cases assume the use of Iris as a message bus with a dependency on AWS SNS and AWS SQS.

For local deployment, testing, and potential containerised deployments, the service should support:

  • message bus Iris alternative that can use one of:
    • Mock of AWS services, or
    • RabbitMQ, or
    • Mock of AWS services, or
    • Celery, or
    • ActiveMQ, or ...

Indexing: W3C Web Annotations - Annotation Studio Drafts

Is it worth considering indexing the Annotation Studio draft annotations?

We have existing code, produced for NLW, that can parse these and convert them into something indexable.

Benefits:

  • We can build in discovery / browse / search while retaining editable annotations
  • No requirement to bulk convert annotations at project end
  • Annotation lists produced via queries (against a IIIF resource: canvas, manifest, range, etc) which return OA or vanilla W3C web annotations can be used as the dissemination copy.

IIIF Content Search: Snippets

https://iiif.io/api/search/1.0/#search-term-snippets

The simplest addition to the hit object is to add text that appears before and after the matching text in the annotation.

The service may add a before property to the hit with some amount of text that appears before the content of the annotation (given in chars), and may also add an after property with some amount of text that appears after the content of the annotation.

  • ocr: from Starsky (or other source), add before and after snippets.
  • textual annotations: add before and after snippets.
    • commenting
    • transcribing
    • translating

Raw access to query language

A post endpoint that will allow read access to the Elasticsearch, maybe with optional formatting for the response (like grouping or formatting in IIIF-compatible way).

Performance: Spike

Questions to consider:

Notes:

https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=fortune&l=zijzen-1
https://fgimian.github.io/blog/2018/05/17/choosing-a-fast-python-api-framework/

Indexing: W3C Web Annotations

The service should support the indexing of W3C web annotations:

The Web Annotation Data Model is complex, so an MVP IIIF Content Search service will probably not support sophisticated granular queries into annotation content.

For MVP it should probably support:

The current Mathmos Elasticsearch index for Web Annotations looks like:

PUT /w3cannotation
{
    "mappings": {
      "annotations": {
        "properties": {
          "body": {
            "type": "text"
          },
          "bodyURI": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "created": {
            "type": "date"
          },
          "creators": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "generated": {
            "type": "date"
          },
          "generator": {
            "type": "text"
          },
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "manifest": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "modified": {
            "type": "date"
          },
          "motivations": {
            "type": "text"
          },
          "oaJsonLd": {
            "type": "text",
            "index": false
          },
          "paredDownOaJsonLd": {
            "type": "text",
            "index": false
          },
          "suggest": {
            "type": "completion",
            "analyzer": "simple",
            "preserve_separators": true,
            "preserve_position_increments": true,
            "max_input_length": 100,
            "contexts": [
              {
                "name": "manifest",
                "type": "CATEGORY"
              }
            ]
          },
          "target": {
            "type": "text"
          },
          "targetURI": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "uri": {
            "type": "text",
            "analyzer": "whitespace"
          },
          "w3cJsonLd": {
            "type": "text",
            "index": false
          },
          "xywh": {
            "type": "text"
          }
        }
      }
    }
}

Dockerised version - with elucidate

Some way to spin up the search service and elucidate together with new annotations sent to elucidate also being sent to the search service. Would allow for quick local search services to be available on smaller projects. May also allow 3rd party content to be run through montague pointing to a dockerised elucidate to be ingested too in isolation.

Fixtures: sample annotations, full text

Create a set of test fixtures and samples that can be used to validate the service.

  • Full text content from OCR service, including non-Roman script.
  • Full text content from transcription, including non-Roman script.
  • Tagging annotations (IDA model)
  • Tagging annotations (RS model)
  • Draft format annotations (Anno studio: NLW models)
  • W3C web annotations (serialised non-draft format from Anno Studio)
  • Manifests and canvases for the above

Integration: Iris message bus

The service should support CRUD events via the Iris message service.

  • index new annotation
  • update existing annotation
  • delete existing annotation

IIIF Content Search: Query parameters

The service should support the standard IIIF Content Search Parameters.

https://iiif.io/api/search/1.0/#query-parameters

  • q: A space separated list of search terms.
  • motivation: A space separated list of motivation terms.
  • user: A space separated list of URIs that are the identities of users.
  • date: A space separated list of date ranges. In ISO8601 format, YYYY-MM-DDThh:mm:ssZ/YYYY-MM-DDThh:mm:ssZ.

Indexing: OCR Text

Index plain text from OCR, to provide:

#10
#12
#11
#6

The current Mathmos index in Elasticsearch looks like:

PUT /text_index
{
"mappings": {
      "plaintext": {
        "properties": {
          "endPositionOfCurrentText": {
            "type": "integer",
            "index": false
          },
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "imageId": {
            "type": "keyword",
            "index": false
          },
          "manifestId": {
            "type": "keyword"
          },
          "nextCanvasId": {
            "type": "keyword",
            "index": false
          },
          "nextImageId": {
            "type": "keyword",
            "index": false
          },
          "plaintext": {
            "type": "text",
            "term_vector": "with_positions_offsets"
          },
          "suggest": {
            "type": "completion",
            "analyzer": "simple",
            "preserve_separators": true,
            "preserve_position_increments": true,
            "max_input_length": 100,
            "contexts": [
              {
                "name": "manifest",
                "type": "CATEGORY"
              }
            ]
          }
        }
      }
    }
 }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.