Coder Social home page Coder Social logo

gospel-search's People

Watchers

 avatar

gospel-search's Issues

Simplify Architecture

By refactoring to perform incremental updates instead of whole system re-ingests, we can eliminate the need for an intermediate datastore (i.e. MongoDB), and ingest directly from the Gospel Library site to the search engine. Also, by refactoring to use a Vector database instead of ElasticSearch, we can eliminate the need for a re-ranking service. This would simplify the Data Transformation Pipeline to:

sequenceDiagram
    actor Operator
    Operator->>Worker: PUT /ingest
    Vector DB->>Worker: current state
    Note over Worker: determine which pages <br/> haven't been ingested yet
    Worker->>Gospel Library: requests
    Gospel Library->>Worker: web pages
    Note over Worker: extract segments and embed
    Worker->>Vector DB: new segments and embeddings

And would simplify the front-end application stack to:

sequenceDiagram
    actor User
    User->>Proxy Server: GET /
    Proxy Server->>User: client app
    User->>Proxy Server: GET /api/search
    Proxy Server->>Vector DB: search query
    Vector DB->>Proxy Server: top-k segments
    Proxy Server->>User: search results

When choosing a Vector DB solution, it must have these requirements:

  • Supports keyword search
  • Supports semantic search via vector search
  • Supports metadata filtering (e.g. filter for only BoM segments)

Create a small evaluation dataset

(Note: I already have the evaluate sub-package in this repo as a starting point).

This is so we can algorithmically compare the ranking performance of various system variations that affect ranking performance, such as:

  • A different, more modern sentence embedding model, like all-mpnet-base-v2
  • Embedding the whole segment, rather than taking the mean of the sentence embeddings for the segment
  • Larger value of k
  • Using a NN lookup for the whole search solution, rather than reranking the top k results returned by ElasticSearch (which uses BM25)

If going the human evaluation route, then pair-wise rankings and the Bradley Terry model for aggregation of those pair-wise rankings into a full ranking is one way to go (source).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.