Coder Social home page Coder Social logo

wh1isper / sparglim Goto Github PK

View Code? Open in Web Editor NEW
31.0 2.0 2.0 147 KB

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

License: BSD 3-Clause "New" or "Revised" License

Python 99.67% Shell 0.33%
jupyter-magic pyspark spark spark-on-kubernetes spark-connect-server spark-connect spark-sql

sparglim's Introduction

Sparglim ✨

Sparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S、Connect Server etc.).

This is a fledgling project, looking forward to any PRs, Feature Requests and Discussions!

🌟✨⭐ Start to support!

Quick Start

Run Jupyterlab with sparglim docker image:

docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim

Access http://localhost:8888 in browser to use jupyterlab with sparglim. Then you can try SQL Magic.

Run and Daemon a Spark Connect Server:

docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server

Access http://localhost:4040 for Spark-UI and sc://localhost:15002 for Spark Connect Server. Use sparglim to setup SparkSession to connect to Spark Connect Server.

Install: pip install sparglim[all]

  • Install only for config and daemon spark connect server pip install sparglim
  • Install for pyspark app pip install sparglim[pyspark]
  • Install for using magic within ipython/jupyter (will also install pyspark) pip install sparglim[magic]
  • Install for all above (such as using magic in jupyterlab on k8s) pip install sparglim[all]

Feature

  • Config Spark via environment variables
  • %SQL and %%SQL magic for executing Spark SQL in IPython/Jupyter
    • SQL statement can be written in multiple lines, support using ; to separate statements
    • Support config connect client, see Spark Connect Overview
    • TODO: Visualize the result of SQL statement(Spark Dataframe)
  • sparglim-server for daemon Spark Connect Server

User cases

Basic

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row

# Create a local[*] spark session with s3&kerberos config
spark = ConfigBuilder().get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

Building a PySpark App

To config Spark on k8s for Data explorations, see examples/jupyter-sparglim-on-k8s

To config Spark for ELT Application/Service, see project pyspark-sampling

Deploy Spark Connect Server on K8S (And Connect to it)

To daemon Spark Connect Server on K8S, see examples/sparglim-server

To daemon Spark Connect Server on K8S and Connect it in JupyterLab , see examples/jupyter-sparglim-sc

Connect to Spark Connect Server

Only thing need to do is to set SPARGLIM_REMOTE env, format is sc://host:port

Example Code:

import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row


c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

SQL Magic

Install Sparglim with

pip install sparglim["magic"]

Load magic in IPython/Jupyter

%load_ext sparglim.sql
spark # show SparkSession brief info

Create a view:

from datetime import datetime, date
from pyspark.sql import Row

df = spark.createDataFrame([
            Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
            Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
            Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
        ])
df.createOrReplaceTempView("tb")

Query the view by %SQL:

%sql SELECT * FROM tb

%SQL result dataframe can be assigned to a variable:

df = %sql SELECT * FROM tb
df

or %%SQL can be used to execute multiple statements:

%%sql SELECT
        *
        FROM
        tb;

You can also using Spark SQL to load data from external data source, such as:

%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;

Develop

Install pre-commit before commit

pip install pre-commit
pre-commit install

Install package locally

pip install -e .[test]

Run unit-test before PR, ensure that new features are covered by unit tests

pytest -v

(Optional, python<=3.10) Use pytype to check typed

pytype ./sparglim

sparglim's People

Contributors

pre-commit-ci[bot] avatar wh1isper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sparglim's Issues

Unable to find S3 credentials when connecting S3 compatible storage

  • sparglim version: The latest version
  • Python version: Using Jupiter notebook provided in jupyter-sparglim-sc
  • Operating System: Ubuntu 22.04, minikube version: v1.31.2

Description

Unable to find S3 credentials when connecting S3 compilable storage (i.e. minio)

What I Did

I successfully executed spark plan with Spark Connect and Jupyter Notebook installed by following the guide in https://github.com/Wh1isper/sparglim/tree/main/examples/jupyter-sparglim-sc/k8s.

For next, I'd like to test its ability to connect S3-compatible storage, such as Minio. But it failed and returned an error "org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider"

I have successfully tested the connection to my S3-compatible storage using local pyspark with the same configurations. Additionally, I have conducted network testing inside Minikube and found no network blockage to my S3 storage.

It seems that it didn't use the access key and secret key provided in spark conf. But not sure the root cause.

Full error log

image

The code for connecting S3

from pyspark.sql import SparkSession
from pyspark import SparkConf

# Set Spark Connection Config

conf = SparkConf()
# ----------------------------- Spark Connect ----------------------------- #
conf.set("spark.remote", "sc://192.168.49.1:30052")
conf.set("spark.app.name", "<Spark App Name>")
# ----------------------------- S3 --------------------------------------- #
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")
conf.set("spark.hadoop.fs.s3a.endpoint", "https://<My S3 Endpoint>.com:10443")
conf.set("spark.hadoop.fs.s3a.access.key", "<My S3 Access Key>")
conf.set("spark.hadoop.fs.s3a.secret.key", "<My S3 Secret Key>")

spark = SparkSession.builder.config(conf=conf).getOrCreate()

df = spark.read.text("s3a://<S3 Path of my file>/test.txt")
df.show()

Fail to execute the spark plan on k8s through spark connect

  • sparglim version: Latest version
  • Python version: Same as the latest image in Jupiter lab in the repo
  • Operating System: Ubuntu 22.04, minikube version: v1.31.2

Description

Followed the guide here to deploy Jupiter lab and spark connect on k8s, but spark failed to execute the spark plan
https://github.com/Wh1isper/sparglim/tree/main/examples/jupyter-sparglim-sc

What I Did

Install k8s through minikube
// Did the same as the guide 
kubectl create clusterrolebinding serviceaccounts-cluster-admin
  --clusterrole=cluster-admin
  --group=system:serviceaccounts
kubectl apply -f examples/jupyter-sparglim-sc/k8s/jupyter-sparglim/
kubectl apply -f examples/jupyter-sparglim-sc/k8s/sparglim-server/

After doing such, I was able to access Jupiter lab and using Jupiter lab to connect spark driver thru spark connect, but when i run the sample code given, spark failed to execute the plan and returned the error "java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD". Meanwhile, in spark UI, it shown that it failed to execute the plan with the above error

The log with error returned

image

Sample code for testing

import os
os.environ["SPARGLIM_REMOTE"] = "sc://xxxx:30052" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row


# c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

Improve CI

  • Switch docker build to 2-step-build, so we dont need to seperate dev dockerfiles
  • Auto release docker image
  • Self-release support

how to use sparglim-server to connect hive metastore

how to use sparglim-server to connect hive metastore?

Problem

I use the sparglim-server, started with hive config, but it only can connect hdfs, not use hive metestore.

at the same time, we sparglim on k8s, it's ok

Proposed Solution

how to config sparglim to make hive metasore? spark.sql(select * from stg.xxx).show()?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.