Coder Social home page Coder Social logo

kraljsamo / elens-miner-system Goto Github PK

View Code? Open in Web Editor NEW

This project forked from e3-jsi/elens-miner-system

0.0 1.0 0.0 563 KB

The microservice architecture for processing, analysing and searching through the environmental legal documents

Home Page: https://jozefstefaninstitute.github.io/eLENS-miner-system/

License: BSD 2-Clause "Simplified" License

Python 57.89% CSS 2.37% JavaScript 0.21% HTML 37.70% Shell 0.36% Batchfile 0.99% PowerShell 0.47%

elens-miner-system's Introduction

eLENS Miner System

License Build Status Python 3.6 Platform

The eLENS miner system retrieves, processes and analyzes legal documents and maps them to specific geographical areas.

The system follows the microservice architecture and is written in Python 3. It consists of the following microservices:

  • Document Retrieval. The service responsible for providing documents based on the user's query. It leverages query expansion to improve the query results.

  • Document Similarity. This service calculates the semantic similarity of the documents and can provide a list of most similar documents to a user selected one. Here, we integrate state-of-the-art methods using word and document embeddings to capture the semantic meaning of the documents and use it to compare the documents.

  • Text Embeddings. The service is a collection of text embedding methods. For a given text it generates the text embedding which is then used in the previous microservices.

  • Entrypoint. This service is the interface and connects the previous microservices together. It is the entrypoint for the users to access the services.

Prerequisites

You may want to create separate virtual environments for each of the microservices or you can create one for all of them. We advise to use virtual environments if you are developing multiple projects with Python, due to clashing of dependencies between projects. (Suppose one project only supports numpy < 1.0 and the other needs numpy=1.5).

To create a virtual environment navigate to the desired directory (usually the main folder of the project) and write

python -m venv venv

To activate this virtual environment navigate into venv/Scripts and then execute activate. To deactivate a virtual environment execute deactivate.

You can see that your virtual environment is being used if you see (venv) before the command line.

Each microservice must be run separately. Each service can be used for themself or one can employ the entrypoint microservice that connects all of the microservices together.

What follows is a short description of how to run each microservice. A more detailed description of the microservice can be found in their designated folders.

Text Embeddings Microservice

Currently you are able to run only one version of the text embedding so that it will be connected to the main component. But later you will be able to connect more.

  • Activate virtual environment if you wish to do so
  • Navigate into text_embeddings folder
  • Execute
    pip install -r requirements.txt
  • Run
    python -m nltk.downloader all
  • Place a copy of your word2vec or fasttext word embeddings in the data/embeddings folder
  • Navigate back to the base of the text_embeddings folder and run the service with
    # linux or mac
    python -m text_embedding.main start \
           -e production \
           -H localhost \
           -p 4001 \
           -mp (path to the model) \
           -ml (language of the model)
    
    # windows
    python -m text_embedding.main start -e production -H localhost -p 4001 -mp (path to the model) -ml (language of the model)

Document Retrieval Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_retrieval folder
  • Execute
    pip install -r requirements.txt
  • Navigate into microservice/config folder
  • Create .env file and inside define the following variables:
    PROD_PG_DATABASE=
    PROD_PG_USERNAME=
    PROD_PG_PASSWORD=
    PROD_TEXT_EMBEDDING_HOST=
    PROD_TEXT_EMBEDDING_PORT=
    
    DEV_PG_DATABASE=
    DEV_PG_USERNAME=
    DEV_PG_PASSWORD=
    DEV_TEXT_EMBEDDING_HOST=
    DEV_TEXT_EMBEDDING_PORT=
  • Navigate to the base of document_retrieval folder and run the service with:
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4100
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4100

If you want you can also run the service on custom host and port.

Document Similarity Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_similarity folder
  • Execute
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create a .env file with the following variables
    PROD_DATABASE_NAME =
    PROD_DATABASE_USER =
    PROD_DATABASE_PASSWORD =
    PROD_TEXT_EMBEDDING_URL =
    
    DEV_DATABASE_NAME =
    DEV_DATABASE_USER =
    DEV_DATABASE_PASSWORD =
    DEV_TEXT_EMBEDDING_URL =
    
  • Set the text embedding url to http://{HOST}:{PORT}/api/v1/embeddings/create where HOST and PORT are the values used to run text embedding microservice
  • Navigate back into the base of the document_similarity folder and run the service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4200
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4200

You can also use custom host and port.

Entrypoint

  • Activate virtual environment if you wish to do so
  • Navigate into entrypoint folder
  • Run
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create .env file with contents
    DEV_DATABASE_USER =
    DEV_DATABASE_HOST =
    DEV_DATABASE_PORT =
    DEV_DATABASE_PASSWORD =
    DEV_DATABASE_NAME =
    
    PROD_DATABASE_USER =
    PROD_DATABASE_HOST =
    PROD_DATABASE_PORT =
    PROD_DATABASE_PASSWORD =
    PROD_DATABASE_NAME =
    
  • Navigame back into entrypoint folder
  • Run the main service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4500
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4500
    However if you routed other microservices to different hosts/ports, you can provide this values in the following way:
    # linux or mac
    python -m microservice.main start -H localhost -p 4500 \
      -teh {host of the text embedding microservice} \
      -tep {port of the text embedding microservice} \
      -drh {host of the document retrieval microservice} \
      -drp {port of the document retrieval microservice} \
      -dsh {host of the document similarity microservice} \
      -dsp {port of the document similarity microservice}
    
    # windows
    python -m microservice.main start -H localhost -p 4500 -teh {host of the text embedding microservice} -tep {port of the text embedding microservice} -drh {host of the document retrieval microservice} -drp {port of the document retrieval microservice} -dsh {host of the document similarity microservice} -dsp {port of the document similarity microservice}

Usage:

Available endpoints:

  • GET {HOST}/{PORT}/api/v1/documents/search query_params query, m

    • query -> your text query
    • m -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/search?query=deforestation&m=10 You will receive top 10 documents similar to query "deforestation".

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>/similar query_params get_k

    • document_id -> id of the document
    • get_k -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/123/similar?get_k=5 You will receive 5 of the most similar documents to document with id 123.

  • POST {HOST}/{PORT}/api/v1/documents/<document_id>/similarity_update

    • document_id -> id of the document

    Example request:

    {BASE_URL}/api/v1/documents/similarity_update Recalculates similarities of the document with the given id to the other documents.

  • GET {HOST}/{PORT}/api/v1/embeddings/create query_params text, language

    • text -> your text
    • language -> language of the text

    Example request:

    {BASE_URL}/api/v1/embedding/create?text=ice cream&language=en You will receive the embedding of the text "ice cream" from the english word embedding model.

  • GET {HOST}/{PORT}/api/v1/documents query_params document_ids

    • document_ids : (comma separated document ids)

    Example request:

    {BASE_URL}/api/v1/documents?document_ids=1,3,17 With the GET request at this endpoint you will receive documents data for documents ids 1, 3 and 17.

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>

    • document_id : (id of the document)

    Example request:

    {BASE_URL}/api/v1/documents/3 With the GET request at this endpoint you will receive documents data for document with id 3.

Acknowledgments

This work is developed by AILab at Jozef Stefan Institute.

The work is supported by the EnviroLENS project, a project that demonstrates and promotes the use of Earth observation as direct evidence for environmental law enforcement, including in a court of law and in related contractual negotiations.

elens-miner-system's People

Contributors

kraljsamo avatar eriknovak avatar zivaurbancic avatar sarabrezec avatar dependabot[bot] avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.