Coder Social home page Coder Social logo

pyspark's Introduction

Apache PySpark in Docker

PySpark docker container based on OpenJDK and Miniconda 3. PySpark 3+ uses OpenJDK 11, PySpark 2 uses OpenJDK 8.

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark 

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

pyspark's People

Contributors

abij avatar barend avatar dandandan avatar dpgrev avatar jcshoekstra avatar krisgeus avatar nielszeilemaker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pyspark's Issues

Temporary AWS credential problem

Hello, this is more a question then an issue, it's possible to update the Hadoop version?

My idea is to use the image in a AWS ECS Task to transform s3 data, I've managed to add boto3 into the image and I can get all kind of credentials, but to use in pyspark I should use like this:

hadoop_cfg.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_cfg.set("fs.s3a.session.token", credentials.token)
hadoop_cfg.set("fs.s3a.access.key", credentials.access_key)
hadoop_cfg.set("fs.s3a.secret.key", credentials.secret_key)

After a long search I found that this only works on Hadoop 2.8+ (actually, spark and Hadoop are really bad managing dependencies, one thing only works with the exact version of other jdk), so the question is if is possible that the Hadoop version could be a parameter or an environment variable for cases like this.

Update to Debian 11 leaves Microsoft SQL users stranded

The OpenJDK base image has upgraded to the fresh new Debian 11 Bullseye. As of today the Microsoft SQL ODBC driver has not yet been released for Bullseye (1). This leaves Azure clients somewhat stranded when SQLServer pops up in their workloads.

For a short term fix, it might be enough to tack -buster onto the end of the OpenJDK base image tag in Dockerfile until Microsoft releases their ODBC driver for Bullseye.

Cannot submit python script as spark job

Hello, I am following the command docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py with my own python script and am getting this error:
Error: No main class set in JAR; please specify one with --class

In the spark documentation they say:
For Python applications, simply pass a .py file in the place of instead of a JAR,

This is what I'm doing - why am I getting this error?

Create a Spark image using the binary spark distribution tars

Currently the spark distribution / hadoop libs in the image is installed using conda / pip which has a few implications.

  • Because pip is being used some parts of the distribution are being left out (such as a start-thriftserver.sh script)
  • The location of the distribution is a weird one, as it's within the conda directory (/opt/miniconda3/lib/python3.8/site-packages/pyspark)

Other findings:

  • variables like SPARK_HOME aren't set
  • Root user is being used
  • Could be using a multi-stage build to reduce image size and to avoid uninstalling dependencies in the Dockerfile

Might also be an idea to use a spark base image, like https://github.com/bitnami/bitnami-docker-spark which improves on all of these points

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.