Coder Social home page Coder Social logo

delftdata / valentine-system Goto Github PK

View Code? Open in Web Editor NEW
8.0 6.0 1.0 1 MB

Valentine scalable deployment for VLDB demo

License: Apache License 2.0

Dockerfile 0.42% HTML 0.34% JavaScript 28.86% CSS 3.25% Shell 0.30% Python 66.83%
data-discovery dataset-discovery schema-matching schema-mapping

valentine-system's Introduction

Valentine in Action: Matching Tabular Data at Scale

This repository contains the system implementation of Valentine with the addition of a holistic schema matching element.

Install

Install using docker-compose

To install docker-compose follow the official instructions. The next step is to build the required containers (client, engine) by running:

docker-compose build

NOTE: For easier setup of Minio create a folder in the projects root named minio-volume and add folders (buckets) with the data you like in there to instantly load them to the system.

Install in minikube

To install minikube follow the official instructions.

NOTE: You will also need to enable the ingress addon by running: minikube addons enable ingress

At first, you have two options, either pull the two required images (client, engine) build by us:

docker pull kpsarakis/schema-matching-engine:latest
docker pull kpsarakis/schema-matching-client:latest

or build them yourself by running

docker build -t kpsarakis/schema-matching-engine:latest ./engine
docker build -t kpsarakis/schema-matching-client:latest --build-arg REACT_APP_SERVER_ADDRESS=/api ./client

Then you need to install Helm by following the official instructions. Once installed run the following commands to deploy a Redis cluster, a Minio cluster and RAbbitMQ (tune their parameters to your liking by going in the helm-config folder in each respective file):

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

helm install -f helm-config/minio-helm-values.yaml minio bitnami/minio
helm install -f helm-config/redis-helm-values.yaml redis bitnami/redis
helm install -f helm-config/rabbitmq-helm-values.yaml rabbitmq bitnami/rabbitmq

The final step is to deploy the client, server, celery worker, flower (celery cluster monitoring), and the ingress service. At first change the configurations of those deployments in the k8s folder (or leave the defaults) and then run the following command to apply them all:

cd k8s/
kubectl apply -f .

Install in a kubernetes cluster

Assuming that you have created a kubernetes cluster somewhere and have the kubectl command configured go in the helm-config and k8s folders to change the deployment configurations to match your use-case then run:

./deploy-charts.sh
cd k8s/
kubectl apply -f .

This will create all the system's components within the cluster and also add a nginx load balancer to handle incoming traffic.

Run

Run with docker-compose

To run the system with docker compose:

docker-compose up

then go to:

  • localhost:3000 to access the UI.
  • localhost:5555 to access the celery cluster monitoring tool Flower.
  • localhost:5000 to access the systems api.
  • localhost:9000 to access Minio.
  • localhost:15672 to access RabbitMQ.

Run with minikube

To access the system deployed with Minikube at first get the IP by running:

minikube ip

then you can access the UI by going to that address, and the api by writing that address with the /api suffix.

If you want to access a specific service run the following command:

minikube service $(SERVICE_NAME)

e.g. for the flower service:

minikube service flower-service

NOTE: The names can be found by running kubectl get svc

NOTE: To access the services deployed by Helm use the instructions given after their deployment.

Run with a kubernetes cluster

The cluster case is similar to Minikube, you have to get the external IP of the nginx load balancer instead of Minikube's and access the UI and api in the same way. For the rest of the services follow either the Helm or your providers port-forwarding instructions.

Repo structure

  • client Module containing the React implementation of the system's UI.

  • engine Module containing the schema matching engine and the backend of the system.

  • env_files Folder containing example env files for the docker-compose.

  • helm-config Folder containing the configuration of the redis, rabbitmq and ingress-nginx charts.

  • k8s Folder containing the kubernetes deployments, apps and services for the client, server, celery worker, flower (celery cluster monitoring), and the ingress service.

valentine-system's People

Contributors

andraionescu avatar chrisk21 avatar kpsarakis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lanfangping

valentine-system's Issues

Create evaluation boxplots

The code placeholder is added in

@celery.task
def generate_boxplot_celery(results: dict):
for algorithm_name, result_paths in results.items():
for result_path in result_paths:
# This contains a single json file information
evaluation_result: dict = get_dict_from_minio_json_file(minio_client, 'valentine-results', result_path)
@app.post('/valentine/generate_boxplot/<job_id>')
def valentine_generate_boxplot(job_id: str):
folder_contents = list_bucket_files('valentine-results', minio_client)
results = folder_contents[job_id]
create_fabricated_data.s(results).apply_async()
return Response('Success', status=200)

For better testing create a folder in the project root named minio-volume and there create a subfolder named valentine-results there you can add a folder with an id that contains a benchmarking job's results.

Add support for postgres

  • Create data source abstractions
  • Integrate Postgres as a data source
  • Allow for jobs that combine data from different data sources

Integrate the data fabrication module of Valentine

add the logic to the following (template) endpoint in the Flask application:

@celery.task
def create_fabricated_data(file_name: str,
dataset_group_name: str,
fabrication_variants: tuple[bool, bool, bool, bool],
fabrication_parameters: tuple[list[bool], list[bool], list[bool], list[bool]],
fabrication_pairs: tuple[int, int, int, int]):
df = get_pandas_df_from_minio_csv_file(minio_client, 'tmp-folder', file_name) # the loaded csv file
fbr_joinable, fbr_unionable, fbr_view_unionable, fbr_semantically_joinable = fabrication_variants
joinable_specs, unionable_specs, view_unionable_specs, semantically_joinable_specs = fabrication_parameters
joinable_pairs, unionable_pairs, view_unionable_pairs, semantically_joinable_pairs = fabrication_pairs
bucket_name = "FabricatedData"
if fbr_joinable:
app.logger.info(f"Fabricating Joinable data for: {file_name}")
# bool array in the format noisy instances, noisy schemata, verbatim instances and verbatim schemata
what_to_fabricate: list[bool] = joinable_specs
pairs: int = joinable_pairs
# example of storing data to minio
# filename = ...
# file = ...
# minio_client.fput_object(bucket_name, filename, file)
if fbr_unionable:
app.logger.info(f"Fabricating Unionable data for: {file_name}")
# bool array in the format noisy instances, noisy schemata, verbatim instances and verbatim schemata
what_to_fabricate: list[bool] = unionable_specs
pairs: int = unionable_pairs
if fbr_view_unionable:
app.logger.info(f"Fabricating View Unionable data for: {file_name}")
# bool array in the format noisy instances, noisy schemata, verbatim instances and verbatim schemata
what_to_fabricate: list[bool] = view_unionable_specs
pairs: int = view_unionable_pairs
if fbr_semantically_joinable:
app.logger.info(f"Fabricating Semantically Joinable data for: {file_name}")
# bool array in the format noisy instances, noisy schemata, verbatim instances and verbatim schemata
what_to_fabricate: list[bool] = semantically_joinable_specs
pairs: int = semantically_joinable_pairs
minio_client.remove_object('tmp-folder', file_name)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.