sql-machine-learning / sqlflow Goto Github PK

View Code? Open in Web Editor NEW

5.0K 170.0 697.0 28.33 MB

Brings SQL and AI together.

Home Page: https://sqlflow.org

License: Apache License 2.0

Dockerfile 0.65% Go 50.02% Yacc 0.77% Shell 4.12% Python 40.59% HTML 0.28% Java 2.46% JavaScript 1.10%

sqlflow sql-syntax ai transpiler deep-learning databases machine-learning

sqlflow's Introduction

SQLFlow

What is SQLFlow

SQLFlow is a compiler that compiles a SQL program to a workflow that runs on Kubernetes. The input is a SQL program that written in our extended SQL grammar to support AI jobs including training, prediction, model evaluation, model explanation, custom jobs, and mathematical programming. The output is an Argo workflow that runs on a Kubernetes cluster distributed.

SQLFlow supports various database systems like MySQL, MariaDB, TiDB, Hive, MaxCompute and many machine learning toolkits like TensorFlow, Keras, XGBoost.

Try SQLFlow NOW in our playground https://playground.sqlflow.tech/ and check out the handy tutorials in it.

Motivation

The current experience of development ML based applications requires a team of data engineers, data scientists, business analysts as well as a proliferation of advanced languages and programming tools like Python, SQL, SAS, SASS, Julia, R. The fragmentation of tooling and development environment brings additional difficulties in engineering to model training/tuning. What if we marry the most widely used data management/processing language SQL with ML/system capabilities and let engineers with SQL skills develop advanced ML based applications?

There are already some work in progress in the industry. We can write simple machine learning prediction (or scoring) algorithms in SQL using operators like DOT_PRODUCT. However, this requires copy-n-pasting model parameters from the training program to SQL statements. In the commercial world, we see some proprietary SQL engines providing extensions to support machine learning capabilities.

Microsoft SQL Server: Microsoft SQL Server has the machine learning service that runs machine learning programs in R or Python as an external script.
Teradata SQL for DL: Teradata also provides a RESTful service, which is callable from the extended SQL SELECT syntax.
Google BigQuery: Google BigQuery enables machine learning in SQL by introducing the CREATE MODEL statement.

None of the existing solution solves our pain point, instead we want it to be fully extensible.

This solution should be compatible to many SQL engines, instead of a specific version or type.
It should support sophisticated machine learning models, including TensorFlow for deep learning and XGBoost for trees.
We also want the flexibility to configure and run cutting-edge ML algorithms including specifying feature crosses, at least, no Python or R code embedded in the SQL statements, and fully integrated with hyperparameter estimation.

Quick Overview

Here are examples for training a TensorFlow DNNClassifier model using sample data Iris.train, and running prediction using the trained model. You can see how cool it is to write some elegant ML code using SQL:

sqlflow> SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

...
Training set accuracy: 0.96721
Done training

sqlflow> SELECT *
FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_dnn_model;

...
Done predicting. Predict table : iris.predict

How to use SQLFlow

Contributing Guidelines

Roadmap

SQLFlow will love to support as many mainstream ML frameworks and data sources as possible, but we feel like the expansion would be hard to be done merely on our own, so we would love to hear your options on what ML frameworks and data sources you are currently using and build upon. Please refer to our roadmap for specific timelines, also let us know your current scenarios and interests around SQLFlow project so we can prioritize based on the feedback from the community.

Feedback

Your feedback is our motivation to move on. Please let us know your questions, concerns, and issues by filing GitHub Issues.

License

Apache License 2.0

Published

An arXiv paper at https://arxiv.org/abs/2001.06846
Demo Videos
1. 01/19/2020: https://www.youtube.com/watch?v=qUjQn7ePbto
2. 10/04/2019: https://www.youtube.com/watch?v=zIkwOQ_davw
3. 04/01/2019: https://www.youtube.com/watch?v=zIkwOQ_davw

sqlflow's People

Contributors

Stargazers

Watchers

Forkers

wangkuiyi tonyyang-svail llxxxll weiguoz yupbank beijinggao tomzhang zorrock nofeetbird0321 weidong3630 zhaojp-frank chunyang-wen davimf piyumalanthony yake0176 fudp cfmcgrady dlreseach huanghaiquan hadoop835 dotrado xukunl kioco qq1358661914 peng-jin strongyoung ymkigeg chakra-coder zerolugithub renjunbakamei tingxin spencerai kaidong123 jiaqiang pacino616 a289237642 alexsun1995 seair-jzh soindy artificial-intelligence-study chuyuanlinzi qsdj 945226956 killallkill zofuthan 0xrad7 martinhe ye-lun ajaxtony st9489 lingjun3033 practicecondtioning jeromeheng awesome-archive hitflame inrtgdje watchinglater gangyou chengtaoyuan biyituan shengmingqijiquan xanderchang chengyh2golang dtss316 litlay chengcanmm77 yuanfeng0905 junlinnick ryan--yang blackwatchuser jasonaidm wangcheny b-xiang 0xqq shuainiujiushishuai lowsea gaogf oopsoutofmemory misselvexu lopage sigma-random jacklee20151 foreverget luoxuehuan yuanyaoxing eavinlau zuodh backyes sadnessofatlantis jhs2jhs hhy5277 yuqi1129 gitxxx typhoonzero mark-gu-sea changerni mingsang wanghia linyizheng xiaozhch5

sqlflow's Issues

Make sure that sqlfile.Create or sqlfs.Create returns error with existing file

Rename package sqlfile into sqlfs

As it represents a filesystem not a single file.

During inference, how to get model config

In the training phase, the sql statement contains model config: DNNClassifier, n_classes and hidden_units

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
TRAIN DNNClassifier
WITH
  n_classes = 3,
  hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL species
INTO my_dnn_model;

However, the infer statement doesn't have it

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
INFER my_dnn_model;

Avoid global variable parsedResult

The current parse API uses a global variable parseResult, is it possible to change it to

parseResult := sqlParse(newLexer("select * ..."))

In the inferencing phase, we are parsing two SQL statements: trainSQL and inferSQL. I feel it would be good practice to avoid sharing the parseResult.

add train phase in executor.go

An executor will execute a training job. It needs to do the following

Before training, generate the training job code using generateTFProgram in codegen.go
Train the model in a docker container, with the working directory mounted as -v option
After training, upload the trained model file to the MySQL database

Add sql/executor.go

An executor will execute the training/evaluation job. It needs to do the following

Download the model files from the MySQL database, if necessary, to the working directory
Copy paste the generated python training file to the working directory
Train the model in a docker container, with the working directory mounted as -v option
After the job is finished, upload the trained model file to the MySQL database, if necessary

Familiarize MySQL setup

The work rest in sqlflow/sql

Missing quotation marks on "columns" and "save" contents

When serializing parsed SQL statement to json, a couple of qutation marks are missing

{
"extended": true,
"train": true,
"standardSelect": "SELECT employee.age, last_name, salary\n FROMemployee\n WHERE employee.age % 10 < (salary / 10000) AND strings.Upper(last_name) = \"WANG\"\n LIMIT 100;",
"trainClause": {
"estimator": "DNNClassifier",
"attrs": {
"hidden_units": "[10, 20]",
"n_classes": "3"
},
"columns": [
employee.name,
bucketize(last_name, 1000),
cross(embedding(emplyoee.name), bucketize(last_name, 1000))
],
"save": my_dnn_model
}
}

make SQL parser support select *

const (
    simpleStarSelect = `
SELECT *
FROM irisis;
`

func TestSimpleSelect(t *testing.T) {
	assert := assert.New(t)
	assert.NotPanics(func() {
		sqlParse(newLexer(simpleStarSelect))
	})
	assert.False(parseResult.Extended)
	assert.Equal([]string{"*"}, parseResult.fields)
}

Gives

--- FAIL: TestSimpleSelect (0.00s)
    parser_test.go:44:
        	Error Trace:	parser_test.go:44
        	Error:      	Not equal:
        	            	expected: []string{"*"}
        	            	actual  : []string(nil)

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,2 @@
        	            	-([]string) (len=1) {
        	            	- (string) (len=1) "*"
        	            	-}
        	            	+([]string) <nil>

        	Test:       	TestSimpleSelect

[Survey] Save the model into a SQL system

Save and load the model into a SQL system

MySQL supports binary data (BLOBs and variations thereof):

[Design] Pipeline like syntax

I am wondering if we could use pipeline like syntax. The parsing would be much easier in this case. And the transformation of the data also looks more nature.

select * from my_table | Normalize | Train DNN

Learn to program SQLFlow server in Go and ZeroMQ

For the context of this issue, please refer to https://github.com/wangkuiyi/sqlflow/issues/64#issuecomment-441339214.

evaluate accuracy on the training data

In the scanfolded TF program, print accuracy on the training data.

Familiarize add new TensorFlow operators

Due to the reason.

The extended SQL syntax for training models using TensorFlow Estimators

Comments are very welcome!

A rough idea in my mind is something like this:

SELECT 
  reviewed_code_lines, 
  contributed_code_lines, 
  performance_eval_level 
FROM employees 
TRAIN DNNClassifier 
PARAMS 
  hidden_units=[10, 10], 
  n_classes=3 
FEATURES 
  reviewed_code_lines, 
  contributed_code_lines, 
  CROSS(reviewed_code_lines, contribiuted_code_lines) 
LABEL
  performance_eval_level
INTO auto_performance_evaluator;

I have a plan to write a parser using flex/bison to parse the above SQL statement with extended syntax. The parser should generate a TensorFlow estimator program similar to that described in this tutorial, but using a MySQLDataset operator, instead of the TextLineDataset operator.

A key challenge here is how to specify the crossed features, which was described in this document.

Another challenge is that how could we save the trained model into a table, e.g., auto_performance_evalutor in the above example.

The star after SELECT should join with other fields

Currently, the deduction rule

https://github.com/wangkuiyi/sqlflow/blob/2bf1f074ac1ab23da98aa61ec53d8f57e48b6c42/sql/sql.y#L178-L182

states that

SELECT *, b ...

is equivalent to

SELECT b ...

However, the existence of * after SELECT should mean all fields of related tables.

Missing Column in Prediction Clause

Verifier uses COLUMN and LABEL field to find data types.

SELECT MonthlyCharges, TotalCharges, tenure
FROM churn.churn
TRAIN DNNClassifier
WITH 
  n_classes = 73,
  hidden_units = [10, 20]
COLUMN MonthlyCharges, TotalCharges
LABEL tenure
INTO my_dnn_model;

However, there is no COLUMN field in the prediction clause

SELECT MonthlyCharges, TotalCharges
FROM churn.churn
PREDICT churn.predict.tenure
USING my_dnn_model;

So how should verifier get data types?

Homepage and documenation

Setup a homepage of SQLFlow. An option is to setup a homepage using Github: https://pages.github.com/

Thinking about column type inference from SQL field types

The current ad-hoc solution is to use a map

https://github.com/wangkuiyi/sqlflow/blob/80e1fe77eb5c6e8ed1c0ffd3869671e8574ae7de/sql/codegen.go#L15

It might be too weak to support any case other than (1) each field is a feature, and (2) all fields are of float type.

For example, users might write the following statement which operates on a single string-typed field f:

SELECT f
FROM table
COLUMN f, hash(f, 100), cross(f, hash(f, 100))

In our Estimator code, the feature list needs three elements:

[
   tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
   tf.feature_column.categorical_column_with_hash_bucket(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100),
   tf.feature_column.cross_column(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
       tf.feature_column.categorical_column_with_hash_bucket(
           tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100))
]

Release Docker image

The first release should allow use to

start a MySQL Docker container on their laptop computer: https://github.com/wangkuiyi/sqlflow/blob/develop/doc/mysql-setup.md
start an SQLFlow Docker container on their laptop computer.

The SQLFlow container should be able to take user input SQL statement as input, either proxy pass to MySQL or parse and run Python training/prediction code in the container.

Rename generateTemplate into newFiller

https://github.com/wangkuiyi/sqlflow/blob/42c08da81c905cd96f4c70165cc4192d8f05143f/sql/codegen.go#L49

The function generateTemplate doesn't return a template; instead, it returns a filler.

Add sql/verifier.go

After lexer and parser, we need a verifier, which connects to the SQL engine, runs DESCRIBE table_name to retrieve fields and field types, makes sure that columns are derived from existing columns, and infer the feature column types from field types.

use shell command tar to save the entire model directory

Instead of using our customized tar, a more straightforward way is to tar the whole model directory.

Test training in codegen_test.go

Test Training Program Autogen

Go can run a process while capturing its stderr and stdout. We want to test that the auto-generated TenosrFlow/Python program can train the model, so we need to run a Docker container using the above technique but with a command like

docker run --rm -it -v $PWD:/work -w /work tensorflow/tensorflow:1.12 python to_be_tested.py

However, we don't want to save the auto-generated file into the filesystem; instead, we want to pipe it to python running in the TensorFlow container. So we can do

echo "print(1)" | docker run --rm -i tensorflow/tensorflow python

To run the above bash program in a command line, we need to run

sh -c 'echo "print(1)" | docker run --rm -i tensorflow/tensorflow python'

we can do this by following ExampleCmd_CombinedOutput in https://golang.org/src/os/exec/example_test.go

However again, the echo "print(1)" in our case is something returned by Go function in our driving program, but not a standalone program. To pipe something to a process, we need to

package main

import (
	"fmt"
	"os/exec"
	"strings"
)

func main() {
	r := strings.NewReader("print(1)")
	cmd := exec.Command("docker", "run", "--rm", "-i", "tensorflow/tensorflow", "python")
	cmd.Stdin = r
	o, _ := cmd.CombinedOutput()
	fmt.Println(string(o))
}

Familiarize TensorFlow Estimator

How to read inputs.
How to convert inputs into features.
How to train the model
How to save the parameters
How to reload the parameters for more training iterations and/or for inference
How to save the model into a SQL system.

Build TensorFlow from source code

Add sql/codegen.go

Need to design the code generate algorithm, particularly, how to map fieldTypes []string returned by sql/verifier.go to tf.feature_column.* calls.

Narrow verify's API

verify in verifier.go only needs extendedSelect.standardSelect

Mount host directory when run TF programs in Docker containers

Store model config at model directory

After training, we need to store model config, such as model type, attribute values. Those values will be used to scanfold inference code.

Unable to untar the model

func (m *model) load(cfg *mysql.Config, cwd string) (e error) {
	db, e := sql.Open("mysql", cfg.FormatDSN())
	if e != nil {
		return e
	}
	defer db.Close()

	sqlfn := fmt.Sprintf("sqlflow_models.%s", m.parseResult.model)
	sqlf, e := sqlfs.Open(db, sqlfn)
	if e != nil {
		return fmt.Errorf("Cannot open sqlfs file %s: %v", sqlfn, e)
	}
	defer func() { sqlf.Close() }()

	if e := gob.NewDecoder(sqlf).Decode(m); e != nil {
		return fmt.Errorf("model.load: gob-decoding model failed: %v", e)
	}

	dir := cwd
	cmd := exec.Command("tar", "Pxzf", "-", "-C", dir)
	cmd.Stdin = sqlf
	return cmd.Run()
}

Gives

tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

Save predicted Y to MySQL

The following SQL statement will save the predicted class into iris.predict.class

SELECT *
FROM iris.iris
PREDICT iris.prediction_table.class
USING my_dnn_model;

Logic:

If selected field's name or type is different from training phase, it should raise an error.
If iris.prediction_table doesn't exist, we should create table iris.prediction_table. And we should figure out the column type of class. If iris.prediction_table already exists, it will be overwritten.

These logic should be implemented in Go.

Move test MySQL connection from test.init() into TestMain

Currently, unit tests establish the connection to the test instance of MySQL in the init() function. This prevents us from closing the connection after testing. We should move the connection and closing operation into TestMain.

Print standardSelect should consider the case of SELECT *

https://github.com/wangkuiyi/sqlflow/pull/38#issuecomment-439729961

[Design] How to communicate between Go Parser and Python TF

reuse tensorflowCmd

tensorflowCmd in codegen_test.go is a useful abstraction and should be used in other components.

https://github.com/wangkuiyi/sqlflow/blob/1f6fa924ab42fe1c1dcdce376915009106e0d6f6/sql/codegen_test.go#L87-L102

Python unable to connect to MySQL Server

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost:3306")

Gives

mysql.connector.errors.DatabaseError: 2005 (HY000): Unknown MySQL server host 'localhost:3306' (2)

However, if we remove port 3306, it connects successfully

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost")

SQLFlow model includes the TF model; other than TF model includes the SQLFlow model

Recall https://github.com/wangkuiyi/sqlflow/pull/108/files#r237716408, I think we should follow the logic, and the following should simply our code.

Refactorize the API of codegen.go

There should be a function named something like codegen or generateCode in codegen.go, like parse in paser.go and verify in verifier.go.

Add a smaller example

Using the all-numeric fields data set from TensorFlow: http://download.tensorflow.org/data/iris_training.csv

Following sqlflow/example/churn

Forgot LABEL clause in parser

https://github.com/wangkuiyi/sqlflow/blob/b7ee49d4e087cb6beeab4fbf58af4dabce39da53/sql/sql.y#L163-L170

Make train job mounted to a random generated directory

https://golang.org/pkg/io/ioutil/#TempDir

Add predict in sql/codegen.go

In the evaluating phase, generateTFProgram should know four things

parsedResult from parser.go. It contains standard select
savedModel from MySQL database. It contains estimator's config.
fieldTypes from verifier.go. It contains columns and columns types.
mysql.Config. It contains username, passwd etc.

The generated TF Program should do the following

Load X from MySQL
Load model from working directory
Predict Y based on X
Save predicted Y to MySQL(#115)

Familiarize TensorFlow feature transform

Move Python template out from code.go into standalone runnable and testable .py files

For example, we can copy code out from codegen.go into sql/python/fetch_data.py like

import tensorflow as tf
import mysql.connector

def sql_connect(user, passwd, host, port, database, slctStmt):
    if not database:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port)
    else:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port,
                                     database=database)

def fetch_data(user, passwd, host, port, database, slctStmt):
    cursor = sql_connect(user, passwd, host, port, database).cursor()
    cursor.execute(slctStmt)
    field_names = [i[0] for i in cursor.description]
    columns = map(list, zip(*cursor.fetchall()))
    return field_names, columns

def slice_feature_and_label(field_names, columns, feature_types, feature_names, label_name):
    feature_columns = [
        getattr(tf.feature_column, feature_types[i])(
            key=feature_names[i]) for i in range(feature_types)]
    feature_column_names = [
        feature_names[i] for i in range(feature_types)]

    X = {name: columns[field_names.index(nm)] for nm in feature_column_names}
    Y = columns[field_names.index(label_name)]

    return X, Y

and we can have sql/python/test_fetch_data.py like

import unittest
import fetch_data

class TestFetchData(unittest.TestCase):

    def __init__(self, *args, **kwargs):
	super(TestFetchData, self).__init__(*args, **kwargs)
        self.user = 'root'
	self.passwd = 'root'
	self.host = 'localhost'
        self.port = 3306
        self.database = ''

    def test_sql_connect(self):
        self.assertNotIsNotNone(sql_connect(
            self.user, self.passwd, self.host, self.port, self.database))

And codegen.go could execute the template to generate only a Python __main__ function which calls the above testable Python functions

if __main__ = "__main__":
    fetch_data(user={{.User}}, passwd={{.Passwd}}, ...
    ....

Possibly remove type connectionConfig in codegen.go

The content of connectionConfig overlap with those in mysql.Config.

A consequence is that we must make sure that the content in both structs are consistent with each other, for example, the following code

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L43-L46

reveals that the content in connectionConfig

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L30-L35

must be consistent with those in mysql.Config

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/db_test.go#L17-L21

It looks to me that we can remove the definition of connectionConfig and change the content of TemplateFiller

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L22-L34

into using mysql.Config and a WorkDir string field:

struct TemplateFiller struct {
   ...
   mysql.Config
   WorkDir string
}

And, change the signature

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L36

into

func NewTemplateFiller(pr *extendedSelect, fts fieldTypes, cfg *mysql.Config, workdir string)
         (*TemplateFiller, bool) {

Download the model files from the MySQL database to the working directory
Verify the infer clause
Prepare prediction table
Predict

sql-machine-learning / sqlflow Goto Github PK

sqlflow's Introduction

SQLFlow

What is SQLFlow

Motivation

Quick Overview

How to use SQLFlow

Contributing Guidelines

Roadmap

Feedback

License

Published

sqlflow's People

Contributors

Stargazers

Watchers

Forkers

sqlflow's Issues

Test Training Program Autogen

Recommend Projects

Recommend Topics

Recommend Org