Coder Social home page Coder Social logo

sql-machine-learning / sqlflow Goto Github PK

View Code? Open in Web Editor NEW
5.0K 170.0 697.0 28.33 MB

Brings SQL and AI together.

Home Page: https://sqlflow.org

License: Apache License 2.0

Dockerfile 0.65% Go 50.02% Yacc 0.77% Shell 4.12% Python 40.59% HTML 0.28% Java 2.46% JavaScript 1.10%
sqlflow sql-syntax ai transpiler deep-learning databases machine-learning

sqlflow's Introduction

SQLFlow

CI codecov GoDoc License Go Report Card

What is SQLFlow

SQLFlow is a compiler that compiles a SQL program to a workflow that runs on Kubernetes. The input is a SQL program that written in our extended SQL grammar to support AI jobs including training, prediction, model evaluation, model explanation, custom jobs, and mathematical programming. The output is an Argo workflow that runs on a Kubernetes cluster distributed.

SQLFlow supports various database systems like MySQL, MariaDB, TiDB, Hive, MaxCompute and many machine learning toolkits like TensorFlow, Keras, XGBoost.

Try SQLFlow NOW in our playground https://playground.sqlflow.tech/ and check out the handy tutorials in it.

Motivation

The current experience of development ML based applications requires a team of data engineers, data scientists, business analysts as well as a proliferation of advanced languages and programming tools like Python, SQL, SAS, SASS, Julia, R. The fragmentation of tooling and development environment brings additional difficulties in engineering to model training/tuning. What if we marry the most widely used data management/processing language SQL with ML/system capabilities and let engineers with SQL skills develop advanced ML based applications?

There are already some work in progress in the industry. We can write simple machine learning prediction (or scoring) algorithms in SQL using operators like DOT_PRODUCT. However, this requires copy-n-pasting model parameters from the training program to SQL statements. In the commercial world, we see some proprietary SQL engines providing extensions to support machine learning capabilities.

  • Microsoft SQL Server: Microsoft SQL Server has the machine learning service that runs machine learning programs in R or Python as an external script.
  • Teradata SQL for DL: Teradata also provides a RESTful service, which is callable from the extended SQL SELECT syntax.
  • Google BigQuery: Google BigQuery enables machine learning in SQL by introducing the CREATE MODEL statement.

None of the existing solution solves our pain point, instead we want it to be fully extensible.

  1. This solution should be compatible to many SQL engines, instead of a specific version or type.
  2. It should support sophisticated machine learning models, including TensorFlow for deep learning and XGBoost for trees.
  3. We also want the flexibility to configure and run cutting-edge ML algorithms including specifying feature crosses, at least, no Python or R code embedded in the SQL statements, and fully integrated with hyperparameter estimation.

Quick Overview

Here are examples for training a TensorFlow DNNClassifier model using sample data Iris.train, and running prediction using the trained model. You can see how cool it is to write some elegant ML code using SQL:

sqlflow> SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

...
Training set accuracy: 0.96721
Done training
sqlflow> SELECT *
FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_dnn_model;

...
Done predicting. Predict table : iris.predict

How to use SQLFlow

Contributing Guidelines

Roadmap

SQLFlow will love to support as many mainstream ML frameworks and data sources as possible, but we feel like the expansion would be hard to be done merely on our own, so we would love to hear your options on what ML frameworks and data sources you are currently using and build upon. Please refer to our roadmap for specific timelines, also let us know your current scenarios and interests around SQLFlow project so we can prioritize based on the feedback from the community.

Feedback

Your feedback is our motivation to move on. Please let us know your questions, concerns, and issues by filing GitHub Issues.

License

Apache License 2.0

Published

sqlflow's People

Contributors

brightcoder01 avatar dependabot[bot] avatar derek-wds avatar echo9573 avatar egolearner avatar ggaaooppeenngg avatar huangxinv587 avatar hungry1526 avatar joyyoj avatar lhw362950217 avatar linkerzhang avatar matrix4284 avatar merlintang avatar publicwlj avatar romainr avatar samplise avatar shendiaomo avatar sneaxiy avatar sperlingxx avatar terrytangyuan avatar tonyyang-svail avatar typhoonzero avatar uuleon avatar wangkuiyi avatar weiguoz avatar yake0176 avatar yancey1989 avatar yuyicg avatar ywskycn avatar zou000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sqlflow's Issues

During inference, how to get model config

In the training phase, the sql statement contains model config: DNNClassifier, n_classes and hidden_units

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
TRAIN DNNClassifier
WITH
  n_classes = 3,
  hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL species
INTO my_dnn_model;

However, the infer statement doesn't have it

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
INFER my_dnn_model;

Avoid global variable parsedResult

The current parse API uses a global variable parseResult, is it possible to change it to

parseResult := sqlParse(newLexer("select * ..."))

In the inferencing phase, we are parsing two SQL statements: trainSQL and inferSQL. I feel it would be good practice to avoid sharing the parseResult.

add train phase in executor.go

An executor will execute a training job. It needs to do the following

  1. Before training, generate the training job code using generateTFProgram in codegen.go
  2. Train the model in a docker container, with the working directory mounted as -v option
  3. After training, upload the trained model file to the MySQL database

Add sql/executor.go

An executor will execute the training/evaluation job. It needs to do the following

  1. Download the model files from the MySQL database, if necessary, to the working directory
  2. Copy paste the generated python training file to the working directory
  3. Train the model in a docker container, with the working directory mounted as -v option
  4. After the job is finished, upload the trained model file to the MySQL database, if necessary

The work rest in sqlflow/sql

  • setup Iris dataset as a MySQL Docker image in example/iris
  • test auto-generated train program
  • test auto-generated evaluate program
  • test auto-generated predict program
  • write SQLFlowRunner and SQLFlowRunnerHandler, whose input is a SQL statement
    1. Parse and translate the SQL statement into a TensorFlow Python training/predicting program.
    2. Call one of the following
      1. Train,
      2. Evaluate, or,
      3. Predict.
  • Jupyter Notebook magic command
  • Set up for single-node regression test.

Missing quotation marks on "columns" and "save" contents

When serializing parsed SQL statement to json, a couple of qutation marks are missing

{
"extended": true,
"train": true,
"standardSelect": "SELECT employee.age, last_name, salary\n FROMemployee\n WHERE employee.age % 10 < (salary / 10000) AND strings.Upper(last_name) = \"WANG\"\n LIMIT 100;",
"trainClause": {
"estimator": "DNNClassifier",
"attrs": {
"hidden_units": "[10, 20]",
"n_classes": "3"
},
"columns": [
employee.name,
bucketize(last_name, 1000),
cross(embedding(emplyoee.name), bucketize(last_name, 1000))
],
"save": my_dnn_model
}
}

make SQL parser support select *

const (
    simpleStarSelect = `
SELECT *
FROM irisis;
`

func TestSimpleSelect(t *testing.T) {
	assert := assert.New(t)
	assert.NotPanics(func() {
		sqlParse(newLexer(simpleStarSelect))
	})
	assert.False(parseResult.Extended)
	assert.Equal([]string{"*"}, parseResult.fields)
}

Gives

--- FAIL: TestSimpleSelect (0.00s)
    parser_test.go:44:
        	Error Trace:	parser_test.go:44
        	Error:      	Not equal:
        	            	expected: []string{"*"}
        	            	actual  : []string(nil)

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,2 @@
        	            	-([]string) (len=1) {
        	            	- (string) (len=1) "*"
        	            	-}
        	            	+([]string) <nil>

        	Test:       	TestSimpleSelect

[Design] Pipeline like syntax

I am wondering if we could use pipeline like syntax. The parsing would be much easier in this case. And the transformation of the data also looks more nature.

select * from my_table | Normalize | Train DNN

The extended SQL syntax for training models using TensorFlow Estimators

Comments are very welcome!

A rough idea in my mind is something like this:

SELECT 
  reviewed_code_lines, 
  contributed_code_lines, 
  performance_eval_level 
FROM employees 
TRAIN DNNClassifier 
PARAMS 
  hidden_units=[10, 10], 
  n_classes=3 
FEATURES 
  reviewed_code_lines, 
  contributed_code_lines, 
  CROSS(reviewed_code_lines, contribiuted_code_lines) 
LABEL
  performance_eval_level
INTO auto_performance_evaluator;

I have a plan to write a parser using flex/bison to parse the above SQL statement with extended syntax. The parser should generate a TensorFlow estimator program similar to that described in this tutorial, but using a MySQLDataset operator, instead of the TextLineDataset operator.

A key challenge here is how to specify the crossed features, which was described in this document.

Another challenge is that how could we save the trained model into a table, e.g., auto_performance_evalutor in the above example.

Missing Column in Prediction Clause

Verifier uses COLUMN and LABEL field to find data types.

SELECT MonthlyCharges, TotalCharges, tenure
FROM churn.churn
TRAIN DNNClassifier
WITH 
  n_classes = 73,
  hidden_units = [10, 20]
COLUMN MonthlyCharges, TotalCharges
LABEL tenure
INTO my_dnn_model;

However, there is no COLUMN field in the prediction clause

SELECT MonthlyCharges, TotalCharges
FROM churn.churn
PREDICT churn.predict.tenure
USING my_dnn_model;

So how should verifier get data types?

Thinking about column type inference from SQL field types

The current ad-hoc solution is to use a map

https://github.com/wangkuiyi/sqlflow/blob/80e1fe77eb5c6e8ed1c0ffd3869671e8574ae7de/sql/codegen.go#L15

It might be too weak to support any case other than (1) each field is a feature, and (2) all fields are of float type.

For example, users might write the following statement which operates on a single string-typed field f:

SELECT f
FROM table
COLUMN f, hash(f, 100), cross(f, hash(f, 100))

In our Estimator code, the feature list needs three elements:

[
   tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
   tf.feature_column.categorical_column_with_hash_bucket(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100),
   tf.feature_column.cross_column(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
       tf.feature_column.categorical_column_with_hash_bucket(
           tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100))
]

Add sql/verifier.go

After lexer and parser, we need a verifier, which connects to the SQL engine, runs DESCRIBE table_name to retrieve fields and field types, makes sure that columns are derived from existing columns, and infer the feature column types from field types.

Test training in codegen_test.go

Test Training Program Autogen

  • Go can run a process while capturing its stderr and stdout. We want to test that the auto-generated TenosrFlow/Python program can train the model, so we need to run a Docker container using the above technique but with a command like
docker run --rm -it -v $PWD:/work -w /work tensorflow/tensorflow:1.12 python to_be_tested.py

However, we don't want to save the auto-generated file into the filesystem; instead, we want to pipe it to python running in the TensorFlow container. So we can do

echo "print(1)" | docker run --rm -i tensorflow/tensorflow python

To run the above bash program in a command line, we need to run

sh -c 'echo "print(1)" | docker run --rm -i tensorflow/tensorflow python'

we can do this by following ExampleCmd_CombinedOutput in https://golang.org/src/os/exec/example_test.go

However again, the echo "print(1)" in our case is something returned by Go function in our driving program, but not a standalone program. To pipe something to a process, we need to

package main

import (
	"fmt"
	"os/exec"
	"strings"
)

func main() {
	r := strings.NewReader("print(1)")
	cmd := exec.Command("docker", "run", "--rm", "-i", "tensorflow/tensorflow", "python")
	cmd.Stdin = r
	o, _ := cmd.CombinedOutput()
	fmt.Println(string(o))
}

Familiarize TensorFlow Estimator

  1. How to read inputs.
  2. How to convert inputs into features.
  3. How to train the model
  4. How to save the parameters
  5. How to reload the parameters for more training iterations and/or for inference
  6. How to save the model into a SQL system.

Add sql/codegen.go

Need to design the code generate algorithm, particularly, how to map fieldTypes []string returned by sql/verifier.go to tf.feature_column.* calls.

Unable to untar the model

func (m *model) load(cfg *mysql.Config, cwd string) (e error) {
	db, e := sql.Open("mysql", cfg.FormatDSN())
	if e != nil {
		return e
	}
	defer db.Close()

	sqlfn := fmt.Sprintf("sqlflow_models.%s", m.parseResult.model)
	sqlf, e := sqlfs.Open(db, sqlfn)
	if e != nil {
		return fmt.Errorf("Cannot open sqlfs file %s: %v", sqlfn, e)
	}
	defer func() { sqlf.Close() }()

	if e := gob.NewDecoder(sqlf).Decode(m); e != nil {
		return fmt.Errorf("model.load: gob-decoding model failed: %v", e)
	}

	dir := cwd
	cmd := exec.Command("tar", "Pxzf", "-", "-C", dir)
	cmd.Stdin = sqlf
	return cmd.Run()
}

Gives

tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

Save predicted Y to MySQL

The following SQL statement will save the predicted class into iris.predict.class

SELECT *
FROM iris.iris
PREDICT iris.prediction_table.class
USING my_dnn_model;

Logic:

  1. If selected field's name or type is different from training phase, it should raise an error.
  2. If iris.prediction_table doesn't exist, we should create table iris.prediction_table. And we should figure out the column type of class. If iris.prediction_table already exists, it will be overwritten.

These logic should be implemented in Go.

Python unable to connect to MySQL Server

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost:3306")

Gives

mysql.connector.errors.DatabaseError: 2005 (HY000): Unknown MySQL server host 'localhost:3306' (2)

However, if we remove port 3306, it connects successfully

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost")

Refactorize the API of codegen.go

There should be a function named something like codegen or generateCode in codegen.go, like parse in paser.go and verify in verifier.go.

Add predict in sql/codegen.go

In the evaluating phase, generateTFProgram should know four things

  1. parsedResult from parser.go. It contains standard select
  2. savedModel from MySQL database. It contains estimator's config.
  3. fieldTypes from verifier.go. It contains columns and columns types.
  4. mysql.Config. It contains username, passwd etc.

The generated TF Program should do the following

  1. Load X from MySQL
  2. Load model from working directory
  3. Predict Y based on X
  4. Save predicted Y to MySQL(#115)

Move Python template out from code.go into standalone runnable and testable .py files

For example, we can copy code out from codegen.go into sql/python/fetch_data.py like

import tensorflow as tf
import mysql.connector

def sql_connect(user, passwd, host, port, database, slctStmt):
    if not database:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port)
    else:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port,
                                     database=database)

def fetch_data(user, passwd, host, port, database, slctStmt):
    cursor = sql_connect(user, passwd, host, port, database).cursor()
    cursor.execute(slctStmt)
    field_names = [i[0] for i in cursor.description]
    columns = map(list, zip(*cursor.fetchall()))
    return field_names, columns

def slice_feature_and_label(field_names, columns, feature_types, feature_names, label_name):
    feature_columns = [
        getattr(tf.feature_column, feature_types[i])(
            key=feature_names[i]) for i in range(feature_types)]
    feature_column_names = [
        feature_names[i] for i in range(feature_types)]

    X = {name: columns[field_names.index(nm)] for nm in feature_column_names}
    Y = columns[field_names.index(label_name)]

    return X, Y

and we can have sql/python/test_fetch_data.py like

import unittest
import fetch_data

class TestFetchData(unittest.TestCase):

    def __init__(self, *args, **kwargs):
	super(TestFetchData, self).__init__(*args, **kwargs)
        self.user = 'root'
	self.passwd = 'root'
	self.host = 'localhost'
        self.port = 3306
        self.database = ''

    def test_sql_connect(self):
        self.assertNotIsNotNone(sql_connect(
            self.user, self.passwd, self.host, self.port, self.database))

And codegen.go could execute the template to generate only a Python __main__ function which calls the above testable Python functions

if __main__ = "__main__":
    fetch_data(user={{.User}}, passwd={{.Passwd}}, ...
    ....

Possibly remove type connectionConfig in codegen.go

The content of connectionConfig overlap with those in mysql.Config.

A consequence is that we must make sure that the content in both structs are consistent with each other, for example, the following code

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L43-L46

reveals that the content in connectionConfig

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L30-L35

must be consistent with those in mysql.Config

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/db_test.go#L17-L21

It looks to me that we can remove the definition of connectionConfig and change the content of TemplateFiller

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L22-L34

into using mysql.Config and a WorkDir string field:

struct TemplateFiller struct {
   ...
   mysql.Config
   WorkDir string
}

And, change the signature

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L36

into

func NewTemplateFiller(pr *extendedSelect, fts fieldTypes, cfg *mysql.Config, workdir string)
         (*TemplateFiller, bool) {

SQLFlow in Jupyter Notebook

In order to make the first milestone, we plan to allow users to run the syntax-extended SQL statements in the Jupyter Notebook.

Add predict in sql/executor.go

A prediction job needs to do the following

  1. Download the model files from the MySQL database to the working directory
  2. Verify the infer clause
  3. Prepare prediction table
  4. Predict

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.