godatadriven-dockerhub / pyspark Goto Github PK

Shell 7.46% Dockerfile 92.54%

pyspark's Introduction

Apache PySpark in Docker

PySpark docker container based on OpenJDK and Miniconda 3. PySpark 3+ uses OpenJDK 11, PySpark 2 uses OpenJDK 8.

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

pyspark's People

Contributors

Stargazers

Watchers

Forkers

abij robert-osborne faisal3325 francestang jcshoekstra sudheerpalyam armandobs14 acmh dandandan capnomad bankmanx9x

pyspark's Issues

Temporary AWS credential problem

Hello, this is more a question then an issue, it's possible to update the Hadoop version?

My idea is to use the image in a AWS ECS Task to transform s3 data, I've managed to add boto3 into the image and I can get all kind of credentials, but to use in pyspark I should use like this:

hadoop_cfg.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_cfg.set("fs.s3a.session.token", credentials.token)
hadoop_cfg.set("fs.s3a.access.key", credentials.access_key)
hadoop_cfg.set("fs.s3a.secret.key", credentials.secret_key)

After a long search I found that this only works on Hadoop 2.8+ (actually, spark and Hadoop are really bad managing dependencies, one thing only works with the exact version of other jdk), so the question is if is possible that the Hadoop version could be a parameter or an environment variable for cases like this.

Update to Debian 11 leaves Microsoft SQL users stranded

The OpenJDK base image has upgraded to the fresh new Debian 11 Bullseye. As of today the Microsoft SQL ODBC driver has not yet been released for Bullseye (1). This leaves Azure clients somewhat stranded when SQLServer pops up in their workloads.

For a short term fix, it might be enough to tack -buster onto the end of the OpenJDK base image tag in Dockerfile until Microsoft releases their ODBC driver for Bullseye.

Cannot submit python script as spark job

Hello, I am following the command docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py with my own python script and am getting this error:
Error: No main class set in JAR; please specify one with --class

In the spark documentation they say:
For Python applications, simply pass a .py file in the place of instead of a JAR,

This is what I'm doing - why am I getting this error?

Create a Spark image using the binary spark distribution tars

Currently the spark distribution / hadoop libs in the image is installed using conda / pip which has a few implications.

Because pip is being used some parts of the distribution are being left out (such as a start-thriftserver.sh script)
The location of the distribution is a weird one, as it's within the conda directory (/opt/miniconda3/lib/python3.8/site-packages/pyspark)

Other findings:

variables like SPARK_HOME aren't set
Root user is being used
Could be using a multi-stage build to reduce image size and to avoid uninstalling dependencies in the Dockerfile

Might also be an idea to use a spark base image, like https://github.com/bitnami/bitnami-docker-spark which improves on all of these points

How to install jdbc drivers like mysql and postgreSQL

I'm trying install using findspark inside my code, but not works.

import findspark

findspark.init()
findspark.add_packages([
    'mysql:mysql-connector-java:8.0.11',
    'postgresql:postgresql:9.1-901-1.jdbc4'
    ])

godatadriven-dockerhub / pyspark Goto Github PK

pyspark's Introduction

Apache PySpark in Docker

Running the container

Samples

pyspark's People

Contributors

Stargazers

Watchers

Forkers

pyspark's Issues

Temporary AWS credential problem

Update to Debian 11 leaves Microsoft SQL users stranded

Cannot submit python script as spark job

Create a Spark image using the binary spark distribution tars

How to install jdbc drivers like mysql and postgreSQL

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent