Coder Social home page Coder Social logo

kaiqiboy / dsso Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 273.5 MB

Python 12.85% Jupyter Notebook 7.58% Scala 67.22% Shell 0.46% Batchfile 0.10% R 2.24% Makefile 0.01% C 0.01% CSS 0.07% HTML 0.14% Roff 0.11% Java 8.09% JavaScript 0.39% PowerShell 0.01% Dockerfile 0.02% Ruby 0.02% ANTLR 0.09% PLpgSQL 0.60% Thrift 0.01% ReScript 0.01%

dsso's Introduction

Deep Learning based Query Optimization for Spark SQL

Existing distributed query systems like Spark SQL rely on manually crafted rules to select an execution plan, which is often sub-optimal. While recent studies have attempted to use deep learning methods for query optimization in conventional relational databases, integrating deep learning models with Spark SQL poses system challenges in efficient candidate plan exploration and real-time deep learning inference. This paper presents an end-to-end deep-learning-based query optimization that is seamlessly integrated with the native Spark system and practically reduces query execution time. Spark SQL’s core logic is adjusted to expand the plan exploration space. An LSTM-based model is devised to estimate the cost of physical execution plans, while real-time performance inference of candidate plans is established. Experimental results evidence that the proposed system leads to over 28% performance improvement on public benchmarks compared to native Spark SQL.

Overview of DSSO

Structure

  • The modified Spark is located at ./spark-3.2.1-modified

  • The deep cost estimation development module is located at ./dsso-dev

  • The deep cost estimation deployment module is located at ./dsso-deploy

  • Scala code for training data generation and end-to-end evaluation is located at ./dsso-test

  • The queries used for DL model development is located at ./data

Usage: DL-enhanced Spark SQL Execution

  • Build the modified Spark
   cd DIR_TO_MODIFIED_SPARK
    ./build/sbt package
  • Run a spark-submit application with cost-estimation-based optimization enabled (example)
bash DIR_TO_MODIFIED_SPARK/bin/spark-submit\
    --class TestXXX\
    --master spark://master:7077\
    --executor-memory 16g\
    --total-executor-cores 48\
    --executor-cores 2\
    --driver-memory 50g\
    --conf spark.sql.autoBroadcastJoinThreshold=8g\
    --conf spark.sql.objectHashAggregate.sortBased.fallbackThreshold=4096\
    --conf spark.sql.ceo=true\
    --conf spark.sql.ceoDir=xxx/cost-estimation-deploy\
    --conf spark.sql.ceoServerIP=127.0.0.1\
    --conf spark.sql.ceoPruneAggressive=true\
    --conf spark.sql.ceoMetadataDir=xxx/xx-metadata\
    --conf spark.sql.ceoLengthThreshold=32\
    xxx.jar inputArgs
Config Explantion
spark.sql.autoBroadcastJoinThreshold Set this native configuration to a large value (8g) to ensure thorough exploration.
spark.sql.objectHashAggregate.sortBased.fallbackThreshold Set this native configuration to a large value (4096) to ensure thorough exploration.
spark.sql.ceo Set true to enable cost-estimation-based optimization (default = false).
spark.sql.ceoDir Directory to the deployment folder, which contains the trained model and facilitative files (default = "/").
spark.sql.ceoServerIP The IP of the DL companion server. If the server is localhost, it can be automatically started by SparkSession, otherwise it has to be manually start (default = "127.0.0.1").
spark.sql.ceoServerPort The port of the DL companion server (default = "8308").
spark.sql.ceoPruneAggressive Set true to enable aggressive pruning (default = false).
spark.sql.ceoMetadataDir Directory storing the metada file of the tables (default = ""). Better in HDFS.
spark.sql.ceoLengthThreshold The plan length threshold of enabling cost-estimation-based optimization (default=500).

Usage: DL model development

Dependencies (from pip):

pytorch
scikit-learn
fse
gensim

TPC-H data generation: https://docs.deistercloud.com/content/Databases.30/TPCH%20Benchmark.90/Data%20generation%20tool.30.xml?embedded=true

  • Data generation

    cd dsso-dev

    RecordQueryTime.scala explains the process of generating training data

  • Moel training

    cd dsso-dev

    First run node_embedding_xxx.ipynb, then run lstm_xxx.ipynb.

    Once the model is trained, move the trained model and the encoding files to ./dsso-deploy.

Usage: DSSO test

  • cd dsso-test; mkdir lib; cp PATH_TO_MODIFIED_JARS/*.jar lib
  • sbt package

dsso's People

Contributors

kaiqiboy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.