Coder Social home page Coder Social logo

josephmachado / analytical_dp_with_sql Goto Github PK

View Code? Open in Web Editor NEW
38.0 2.0 15.0 405 KB

Code for my "Efficient Data Processing in SQL" book.

Home Page: https://josephmachado.gumroad.com/l/analyticalsql

Makefile 100.00%
big-data olap sql trino

analytical_dp_with_sql's Introduction

Setup

Please install the following software:

  1. git version >= 2.37.1
  2. Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2. Make sure that docker is running using docker ps.

Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary). Please install the make command with sudo apt install make -y (if its not already present).

All the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). We will use docker to set up our containers. Clone and move into the lab repository, as shown below.

git clone \
https://github.com/josephmachado/analytical_dp_with_sql.git 
cd analytical_dp_with_sql

Note: If you are using Macbook M1, please follow the instructions here to use the appropriate docker image.

We have some helpful make commands to make working with our systems more accessible. Shown below are the make commands and their definitions

  1. make up: Spin up the docker containers.
  2. make trino: Open trino cli; Use exit to quit the cli. This is where you will type your SQL queries.
  3. make down: Stop the docker containers.

You can see the commands in this Makefile. If your terminal does not support make commands, please use the commands in the Makefile directly. All the commands in this book assume that you have the docker containers running.

In your terminal, do the following:

# Make sure docker is running using docker ps
make up # starts the docker containers
sleep 60 # wait 1 minute for all the containers to set up
make trino # opens the trino cli

In Trino, we can connect to multiple databases (called catalogs in Trino). TPC-H is a dataset used to benchmark analytical database performance. Trino's tpch catalog comes with preloaded tpch datasets of different sizes tiny, sf1, sf100, sf100, sf300, and so on, where sf = scaling factor.

-- run "make trino" or 
-- "docker container exec -it trino-coordinator trino" 
-- to open trino cli

USE tpch.tiny;
SHOW tables;
SELECT * FROM orders LIMIT 5;
-- shows five rows, press q to quit the interactive results screen
exit -- quit the cli

Note: Run make trino or docker container exec -it trino-coordinator trino on your terminal to open the trino cli. The SQL code shown throughout the book assumes you are running it in trino cli.

Starting the docker containers will also start Minio(S3 alternative); we will use Minio as our data store to explain efficient data storage.

UI: Open the Trino UI at http://localhost:8080 (username: any word) and Minio (S3 alternative) at http://localhost:9001 (username: minio, password: minio123) in a browser of your choice.

If you prefer to connect to Trino via a SQL IDE, download DBeaver (addresses issues mentioned here). Open DBeaver,

  1. Click on Database -> New Database Connection
  2. A Connect to a database box will open; search for, select Trino, and press Next.
  3. Do not change settings; use user as the user name. You will get a connected text box if you test the connection. Click Finish, and you will be able to explore our Trino database.

DBeaver

Data Model

The TPC-H data represents a car parts seller's data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part(parts sold), region, nation, and partsupp.

Note: Have a copy of the data model as you follow along; this will help with the examples provided and answering exercise questions.

TPC-H data model

Acknowledgments

We use the TPC-H dataset and Trino as our OLAP DB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.