Coder Social home page Coder Social logo

gospel-search's Introduction

Gospel Search

The architecture of this project is broken up into two separate workflows: the data transformation pipeline, and the online search engine user experience.

TL;DR

poe build
poe start:services
# Then in another terminal:
poe populate-chroma # only needed if there's new data to index.
open http://localhost:3000 # or `poe proxy` for serving prod

Data Transformation Pipeline

The data transformation pipeline can be started up via:

poetry run poe build
poetry run poe start:db

HTTP requests can then be made to the worker service at http://localhost:8080 to accomplish the steps outlined in the following sequence diagram:

sequenceDiagram
    actor Operator
    Operator->>Worker: PUT /crawl
    Worker->>Gospel Library: requests
    Gospel Library->>Worker: web pages
    Worker->>MongoDB: web pages
    Operator->>Worker: PUT /extract
    MongoDB->>Worker: web pages
    Worker->>MongoDB: extracted segments
    Operator->>Worker: PUT /embed
    MongoDB->>Worker: segments
    Worker->>MongoDB: embeddings
    Operator->>Worker: PUT /populate-es
    Worker->>ElasticSearch: segments
Loading

Note: Currently the ElasticSearch index is not persisted across docker image start-ups, so PUT /populate-es has to be called every time ElasticSearch starts up. That command only takes 60 or so seconds to run, so its not a big deal right now.

Search Engine User Experience

Once the embeddings have been saved to the MongoDB instance and the segments have been loaded into the ElasticSearch instance, the search engine application stack can be started via:

poetry run poe build
poetry run poe start:services

The front-end UI can then be accessed via http://localhost:3000. User requests are handled using this workflow:

sequenceDiagram
    actor User
    User->>Proxy Server: GET /
    Proxy Server->>User: client app
    User->>Proxy Server: GET /api/search
    Proxy Server->>ElasticSearch: search query
    ElasticSearch->>Proxy Server: top-k segments
    Proxy Server->>NLP Service: top-k segments
    NLP Service->>Proxy Server: reranked top-k segments
    Proxy Server->>User: search results
Loading

Overview of directory structure:

  • gospel_search/elasticsearch/: The code related to the ElasticSearch search engine server.
  • gospel_search/mongodb/: The code related to the MongoDB database which stores all the segments and embedding vectors.
  • gospel_search/ui: The code for the proxy server and user interface.
  • gospel_search/web_scraping: The code for the HTML scraper.
  • gospel_search/worker: The code for the worker server which runs all the ETL tasks.

Notes

The church has a new public API you can use, which looks like this:

curl 'https://www.churchofjesuschrist.org/study/api/v3/language-pages/type/content?lang=eng&uri=/general-conference/1971/04/life-is-eternal'

You can also fetch the index page for a session using e.g.:

curl 'https://www.churchofjesuschrist.org/study/api/v3/language-pages/type/content?lang=eng&uri=/general-conference/1971/04'

The body field has the html string with the page content. And there's a footnotes property.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.