Coder Social home page Coder Social logo

aws-samples / tabular-column-semantic-search Goto Github PK

View Code? Open in Web Editor NEW
16.0 7.0 4.0 484 KB

Code accompanying AWS blog post "Build a Semantic Search Engine for Tabular Columns with Transformers and Amazon OpenSearch Service"

License: Apache License 2.0

Python 98.51% Dockerfile 1.49%
approximate-nearest-neighbor-search aws aws-cdk aws-lambda columns faiss nlp opensearch sentence-transformers streamlit-webapp

tabular-column-semantic-search's Introduction

Tabular Column Semantic Search

This app creates the following:

  1. An automated pipeline for embedding column data from CSVs and indexing the embeddings to OpenSearch.
  2. A web app enabling users to search for the approximate nearest neighbors to a provided input.

Services used:

Embeddings are created using SentenceTransformers. By default the following models are used:

  • all-MiniLM-L6-v2
  • all-distilroberta-v1
  • average_word_embeddings_glove.6B.300d

Architecture

Deployment prerequisites

  1. AWS CDK
  2. Docker running in the background

How do I use this pipeline and web app?

  1. Customize email, username, and any other desired configs in config.yaml.
  2. Deploy resources by following the steps below. Recommended: Deploy CDK from a cloud based instance such as EC2 or Cloud9.
  3. Once deployed, upload CSV files with column headings to the data/csv/input/file or data/csv/input/batch paths of the S3 bucket created during deployment. Files uploaded to data/csv/input/file will be individually processed automatically upon upload. Files uploaded to data/csv/input/batch will be processed in batch when the pipeline is manually triggered. During pipeline execution, input data will be automatically embedded and indexed to OpenSearch. After successful indexing, input data is moved to data/csv/processed/. You can track the pipeline status in the Step Function State Machine console.
    • To upload batch CSV files, run the script run_pipeline.py from the commandline: The default options for the script will upload sample batch datasets from sample-batch-datasets.json to the S3 bucket (<DESTINATION_BUCKET>/data/csv/input/batch). And invoke the Lambda function that starts pipeline.
          python tools/run_pipeline.py --destination_bucket <DESTINATION_BUCKET> --input_mode batch --batch_datasets_file sample-batch-datasets.json
      
    • To upload a single CSV file, run the same script run_pipeline.py with the following
          python tools/run_pipeline.py --destination_bucket <DESTINATION_BUCKET> --input_mode file --file_or_url <LOCAL_OR_REMOTE_CSV_PATH>
      
  4. After deployment, you will receive sign-in credentials and the web app URL via email, at the email you specified in config.yaml. Log in to the web app using these credentials. You will be prompted to reset your password during the first login.
    • Note the demo creates and uses a self-signed certificate for the web app, which may not be trusted by your web browser by default. Self-signed certificates should not be used beyond testing. For best security, use a certificate signed be a credible CA.
  5. Use the web app to query OpenSearch and explore results.

Steps to deploy

Create a virtual environment:

$ python3 -m venv .venv

Activate your virtualenv:

$ source .venv/bin/activate

Install the required dependencies:

$ pip install --upgrade pip
$ pip install -r requirements.txt

At this point you can synthesize the CloudFormation template for this code:

$ cdk synth

Bootstrap your default AWS account/region. Note you may incur AWS charges for data stored in the bootstrapped resources.

$ cdk bootstrap

Deploy the pipeline to your default AWS account/region. Note Docker needs to be running in the background. During deployment, you will be prompted to confirm deployment of each stack. Resources will incur charges in your account while deployed.

$ cdk deploy --all

To tear down the pipeline, run the following aptly named command. You will be prompted to confirm deletion.

$ cdk destroy --all

Useful commands

  • cdk ls list all stacks in the app
  • cdk synth emits the synthesized CloudFormation template
  • cdk deploy deploy this stack to your default AWS account/region
  • cdk diff compare deployed stack with current state
  • cdk docs open CDK documentation
  • cdk destroy destroy existing stack

tabular-column-semantic-search's People

Contributors

amazon-auto avatar austinmw avatar itzrahulyadav avatar kachio avatar taymcn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.