Coder Social home page Coder Social logo

aws-samples / query-databases-with-natural-language Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 1.0 688 KB

This project uses Mixtral 8x7B Instruct, deployed in Amazon SageMaker, to enable users to chat with their database using natural language, without writing any code or SQL query.

License: MIT No Attribution

Jupyter Notebook 100.00%
aws database generative-ai mixtral python sagemaker text2sql redshift

query-databases-with-natural-language's Introduction

Transforming the Way We Talk to Databases: Using Everyday Language to Search and Retrieve Data with Mixtral 8x7B

This project enables interacting with relational databases through Natural Language. Specifically, it uses the sparse Mixture of Experts (MoE) model Mixtral 8x7B Instruct, for generating SQL queries, and interpreting their corresponding tabular results to return to the user as answers. The application leverages SageMaker hosting tools to serve the model, which is deployed with a few clicks from SageMaker JumpStart. For more details, please refer to this blogpost.

Setup Requirements

You can currently deploy Mixtral 8x7B on SageMaker Jumpstart with one click. Amazon SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mixtral 8x7B from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint usage. You can easily request service quota increases through the AWS console, CLI, or API to get access.

You will also need access to a relational data source. Amazon Redshift is used as the primary data source in this post with the TICKIT database. This database helps analysts track sales activity for the fictional TICKIT web site, where users buy and sell tickets online for sporting events, shows, and concerts.

Please review any license terms applicable to the dataset with your legal team and confirm that your use case complies with the terms before proceeding.

You will first need to set up a Redshift cluster if you don't already have one. Use the Amazon Redshift console or CLI to launch a cluster with your desired node type and number of nodes. Make sure to note the cluster endpoint, database name, and credentials to connect.

Once the cluster is available, create a new database and tables in it to hold the relational data. You can load data for the TICKIT database from S3 following these steps.

To test that you successfully added data to your Redshift cluster. Follow these steps:

  1. On the Redshift console, choose Clusters and select the cluster to query.
  2. Click on the Query Editor tab to open the query editor.
  3. You can run the following sample queries, or write your own:
/* Find total sales on a given date. */
SELECT sum(qtysold)
FROM sales, date
WHERE sales.dateid = date.dateid AND caldate = '2008-01-05';
/* Find the top 10 buyers. */
SELECT firstname, lastname, total_quantity
FROM (SELECT buyerid, sum(qtysold) total_quantity
      FROM sales GROUP BY buyerid
      ORDER BY total_quantity
      desc limit 10) Q, users
WHERE Q.buyerid = users.userid
ORDER BY Q.total_quantity desc;

If you get successful responses, it means that you have correctly loaded the database data onto the cluster. The query editor allows saving, scheduling and sharing queries. You can also view query plans, execution details and monitor query performance.

We recommend running this notebook in Amazon SageMaker Studio. For that, you must first set up a SageMaker domain, making sure it has the appropriate permissions to interact with Amazon Redshift. Then, clone this GitHub repository into SageMaker Studio Classic with the following command:

git clone https://github.com/aws-samples/query-databases-with-natural-language.git

Open the query-amazon-redshift-with-mixtral-8x7B-instruct.ipynb notebook to run through it.

Solution Architecture

At a high level, Text2SQL solutions such as the one in this repository, consist of three core components:

  1. Structured Data Source: This can be any relational data source such as Amazon RDS, Amazon Aurora, AWS Athena, or Snowflake. It contains the business data to query.

  2. Foundation Model: A large language model (LLM) that is able to understand the data schema of the source database and map natural language questions into corresponding SQL queries.

  3. Orchestrator Back-end: The code scripts can be executed in environments such as a SageMaker Studio notebook, a Lambda function, EC2, or ECS. On top of that, you could optionally add an orchestration service, such as AWS Step Functions, if needed.

In the code provided in this repository, the architecture is the following:

The end-to-end flow is as follows:

  1. The user asks a natural language question which is passed to the Mixtral 8x7B Instruct model, hosted in SageMaker.

  2. The LLM analyzes the question and uses the schema fetched from the connected Redshift database to generate a SQL query.

  3. The SQL query is run against the database. In case of an error, a retry workflow is executed.

  4. Tabular results received are passed back to the LLM for interpretation and to convert them into a natural language response to the user's original question.

For a step-by-step walk-through of the implementation, please check out the reference blogpost.

query-databases-with-natural-language's People

Contributors

joseanavarrom avatar amazon-auto avatar

Stargazers

Casper avatar Timothy Spann avatar नवुले पवन कुमार राव avatar  avatar  avatar Kryspin Ziemski avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

sandy4321

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.