Coder Social home page Coder Social logo

criteo / cluster-pack Goto Github PK

View Code? Open in Web Editor NEW
46.0 10.0 21.0 444 KB

A library on top of either pex or conda-pack to make your Python code easily available on a cluster

License: Apache License 2.0

Python 99.25% PowerShell 0.41% Shell 0.34%
pex conda-pack skein pyspark hdfs s3

cluster-pack's Introduction

cluster-pack

cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster.

Its goal is to make your prod/dev Python code & libraries easiliy available on any cluster. cluster-pack supports HDFS/S3 as a distributed storage.

The first examples use Skein (a simple library for deploying applications on Apache YARN) and PySpark with HDFS storage. We intend to add more examples for other applications (like Dask, Ray) and S3 storage.

An introducing blog post can be found here.

cluster-pack

Installation

Install with Pip

$ pip install cluster-pack

Install from source

$ git clone https://github.com/criteo/cluster-pack
$ cd cluster-pack
$ pip install .

Prerequisites

cluster-pack supports Python ≥3.6.

Features

  • Ships a package with all the dependencies from your current virtual environment or your conda environment

  • Stores metadata for an environment

  • Supports "under development" mode by taking advantage of pip's editable installs mode, all editable requirements will be uploaded all the time, making local changes directly visible on the cluster

  • Interactive (Jupyter notebook) mode

  • Provides config helpers to directly use the uploaded zip file inside your application

  • Launching jobs from jobs by propagating all artifacts

Basic examples with skein

  1. Interactive mode

  2. Self shipping project

Basic examples with PySpark

  1. PySpark with HDFS on Yarn

  2. Docker with PySpark on S3

cluster-pack's People

Contributors

alois-bissuel avatar aroville-criteo avatar ax-vivien avatar brhcriteo avatar darkjh avatar fhoering avatar guillaumegenthial avatar jcuquemelle avatar kamaradclimber avatar nateagr avatar npfp avatar paulmathon avatar rfbr avatar t-henri avatar totoketchup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-pack's Issues

Adopt semantic versioning for releases

Hi,
I think it would be nice to use semantic versioning for releases versions.
https://semver.org/
It means selecting version numbers as MAJOR.MINOR.PATCH
Major version should be incremented when a breaking change is done
Minor version when a feature is added
Patch when a fix is done
This is very useful for dependents to know how much changes were introduced in a new version.
This works best when major is a version >= 1 so I advise to always start at 1.0.0
Making breaking change is something that is useful to advertise for dependents and increasing the major version definitely make sense in such cases. Being at major version 5 is ok if there was a few refactoring that changed the API a lot.

I believe this versionning scheme is better than the "versionning as marketing" that is adopted by some software which consists in saying "our new major version is a big milestone, look at all the new features" which for dependents is not very useful.

In the case of cluster pack it would be releasing a 1.0.0 then following the scheme when releasing.

What do you think ?

Inconsistent kernel environment uploaded in jupyter notebook

Hi,

We use cluster-pack in jupyter notebooks with conda environments.
An issue we've found is that when the kernel is in a different conda environment than the jupter notebook server, the uploaded zip (env_name.tar.gz) contains the envionrment of jupyter notebook but the description file env_name.tar.json list the correct kernel environment.

A concrete example:

  1. jupyter notebook installed and launched from jupyter conda env
  2. Another conda env tf is created and its kernel is installed
  3. Launch a notebook using the tf kernel
  4. Do package_path, _ = cluster_pack.upload_env()
  5. jupyter.tar.json and jupyter.tar.gz are uploaded
  6. jupyter.tar.json lists correctly the libs in tf conda env (not jupyter)
  7. But jupyter.tar.gz packages actually the jupyter conda env

Any option to include python binary in pex bundle

Hi,

I have a use-case where python interpreter will not be available on all the nodes. Is it possible to bundle python binary along with pex and execute in environment where python interpreter is not present?? I know for those cases we may need to use freezers..

Any suggestion

Thanks,

Using pex binary without installing in system

Hi,

1- Is that possible that without installing pex or creating any virtual env, we can use pex command from some binary/package and create pex file?
2- how can we include static files present at some location/Artifactory into pex executable.
3- handling optional packages installed out of virtual env. e.g. NLTK library

Do we need PYTHONPATH to set to spark/python OR do we package pyspark in .pex file?

Hi, I am not able to run the pyspark code until i bundle pyspark in .pex file.

Though in normal scenarios we set PYTHONPATH as below
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PYTHONPATH

spark-submit
--master yarn
--deploy-mode cluster
--conf spark.pyspark.driver.python=./my_application_saprk.pex
--conf spark.pyspark.python=./my_application_saprk.pex
--conf spark.executorEnv.PEX_ROOT=./tmp
--conf spark.yarn.appMasterEnv.PEX_ROOT=./tmp
--files my_application_saprk.pex
pyspark_pandas.py

It's not able to find pysaprk

"ModuleNotFoundError: No module named 'pyspark' "

Can any one of you please help here.

-Thanks

Any possibility to package .egg files in pex bundle

I have a requirement where .egg file is provide as all python libs, due to security reason.
but looks like pex > 2.0 doesn't support picking up .egg files while bundling.

Is there any option to achieve it to bundle .egg file from a local directory??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.