Coder Social home page Coder Social logo

parallel-tutorial's Introduction

Parallel Python: Analyzing Large Datasets

Join the chat at https://gitter.im/pydata/parallel-tutorial

Student Goals

Students will walk away with a high-level understanding of both parallel problems and how to reason about parallel computing frameworks. They will also walk away with hands-on experience using a variety of frameworks easily accessible from Python.

Student Level

Knowledge of Python and general familiarity with the Jupyter notebook are assumed. This is generally aimed at a beginning to intermediate audience.

Outline

For the first half we cover basic ideas and common patterns in parallel computing, including embarrassingly parallel map, unstructured asynchronous submit, and large collections.

For the second half we cover complications arising from distributed memory computing and exercise the lessons learned in the first section by running informative examples on provided clusters.

  • Part one
    • Parallel Map
    • Asynchronous Futures
    • High Level Datasets
  • Part two
    • Scaling cross validation parameter search
    • Tabular data with map/submit
    • Tabular data with dataframes

Installation

  1. Download this repository:

    git clone https://github.com/pydata/parallel-tutorial
    

    or download as a zip file.

  2. Install Anaconda (large) or Miniconda (small)

  3. Create a new conda environment:

     conda env create -f environment.yml
     source activate parallel  # Linux OS/X
     activate parallel         # Windows
    
  4. If you want to use Spark (this is a large download):

     conda install -c conda-forge pyspark
    

Test your installation:

python -c "import concurrent.futures, dask, jupyter"

Dataset Preparation

We will generate a dataset for use locally. This will take up about 1GB of space in a new local directory, data/.

python prep.py

Part 1: Local Notebooks

Part one of this tutorial takes place on your laptop, using multiple cores. Run Jupyter Notebook locally and navigate to the notebooks/ directory.

jupyter notebook

The notebooks are ordered 1, 2, 3, so you can start with 01-map.ipynb

Part 2: Remote Clusters

Part two of this tutorial takes place on a remote cluster.

Visit the following page to start an eight-node cluster: https://pydata-parallel.jovyan.org/

If at any point your cluster fails you can always start a new one by re-visiting this page.

Warning: your cluster will be deleted when you close out. If you want to save your work you will need to Download your notebooks explicitly.

Slides

Brief, high level slides exist at http://pydata.github.io/parallel-tutorial/.

Sponsored Cloud Provider

We thank Google for generously providing compute credits on Google Compute Engine.

parallel-tutorial's People

Contributors

ahmadia avatar gitter-badger avatar minrk avatar mrocklin avatar quasiben avatar rgbkrk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parallel-tutorial's Issues

No such file or directory: 'snakeviz'

Hi, when I run the script in 01-map.ipynb, and I always got this error msg.

I have installed snakeviz 0.4.1, any clues? thanks

FileNotFoundError: [Errno 2] No such file or directory: 'snakeviz'

%%snakeviz

for fn in filenames:
    print(fn)
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

README environment set-up, separate environment?

Given all the tutorials students are running, may be better to advise them to create a separate environment for this tutorial, rather than using root. For example, consider including the additional step below into the README, instead of assuming they should install all the packages for this tutorial into root:

conda create -n parallel
source activate parallel
conda install jupyter pytables matplotlib scikit-learn

having issues with initial test

I am trying to make sure I have everything up and running before the Strata Conference next week!

Running Windows 10 Education, version 1703, 64-bit
Using Git Bash

Followed the install instructions for the parallel-tutorial
When I attempt to test the installation, I get an error.

Operator error?
initial_test

graphviz

Add graphviz to env file and cluster env

Prep.py throws an AssertionError

Hi,

I just wanted to try the tutorial after I found this video on YouTube:
https://www.youtube.com/watch?v=5Md_sSsN51k

I use Anaconda on Windows 10, installed all necessary packages and after I tried to execute the first line in the notebook, I get the following error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
C:\Users\Name\Documents\Python Scripts\parallel-tutorial-master\parallel-tutorial-master\prep.py in <module>()
     39 
     40 for symbol in stocks:
---> 41     write_stock(symbol)
     42 
     43 

C:\Users\Name\Documents\Python Scripts\parallel-tutorial-master\parallel-tutorial-master\prep.py in write_stock(symbol)
     35         names = [str(ts.date()) for ts in df.divisions]
     36         df.to_csv(os.path.join(here, 'data', 'minute', symbol, '*.csv'),
---> 37                   name_function=names.__getitem__)
     38         print("Finished CSV: %s" % symbol)
     39 

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\core.py in to_csv(self, filename, **kwargs)
    957         """ See dd.to_csv docstring for more information """
    958         from .io import to_csv
--> 959         return to_csv(self, filename, **kwargs)
    960 
    961     def to_delayed(self):

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py in to_csv(df, filename, name_function, compression, compute, get, storage_options, **kwargs)
    503 
    504     if compute:
--> 505         delayed(values).compute(get=get)
    506     else:
    507         return values

C:\Users\Name\Anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
     97             Extra keywords to forward to the scheduler ``get`` function.
     98         """
---> 99         (result,) = compute(self, traverse=False, **kwargs)
    100         return result
    101 

C:\Users\Name\Anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    204     dsk = collections_to_dsk(variables, opNameize_graph, **kwargs)
    205     keys = [var._keys() for var in variables]
--> 206     results = get(dsk, keys, **kwargs)
    207 
    208     results_iter = iter(results)

C:\Users\Name\Anaconda3\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     73     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     74                         cache=cache, get_id=_thread_get_id,
---> 75                         pack_exception=pack_exception, **kwargs)
     76 
     77     # Cleanup pools associated to dead threads

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

C:\Users\Name\Anaconda3\lib\site-packages\dask\compatibility.py in reraise(exc, tb)
     58         if exc.__traceback__ is not tb:
     59             raise exc.with_traceback(tb)
---> 60         raise exc
     61 
     62 else:

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in _execute_task(arg, cache, dsk)
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

C:\Users\Name\Anaconda3\lib\site-packages\dask\compatibility.py in apply(func, args, kwargs)
     45     def apply(func, args, kwargs=None):
     46         if kwargs:
---> 47             return func(*args, **kwargs)
     48         else:
     49             return func(*args)

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\io\demo.py in generate_day(date, open, high, low, close, volume, freq, random_state)
    114         values += np.linspace(open - values[0], close - values[-1],
    115                               len(values))  # endpoints
--> 116         assert np.allclose(open, values[0])
    117         assert np.allclose(close, values[-1])
    118 

AssertionError: 

I also tried it unter MacOS Sierra with Miniconda and an extra enviroment as well as under Ubuntu 17.04 with Miniconda. And now I am out of operating systems ๐Ÿ˜„

Error when installing prep.py - ImportError: No module named '_pandasujson'

Traceback (most recent call last):
  File "prep.py", line 64, in <module>
    dask.compute(values)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/base.py", line 204, in compute
    results = get(dsk, keys, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/multiprocessing.py", line 177, in get
    raise_exception=reraise, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/compatibility.py", line 59, in reraise
    raise exc.with_traceback(tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 289, in execute_task
    task, data = loads(task_info)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 840, in subimport
    __import__(name)
ImportError: No module named '_pandasujson'

Notebook 03a, SparkContext() will not run without modifying `/etc/hosts` on OSX.

Students in your tutorial may encounter this. The following code cell generates an error without updating /etc/hosts

sc = SparkContext('local[4]')

The error is as following:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries!

Adding the following to the end of /etc/hosts enabled me to run the cell successfully:

127.0.0.1 <hostname>

Where <hostname> is the output from calling hostname from the shell.

google finance dataset deprecated

Hi there,
it seems that the google finance API is deprecated and as such the "prep.py" file is not working anymore!
can someone describe what has to be changed in the prep.py or the assorted linked files so we can go along with this amazing and very informative tutorial ???

cheers,
ps: cloned this repository on a mac

Spark S3 access

I suspect that this is because we have yet to include credentials. Still, reporting it here just in case.

rdd = sc.textFile("s3a://githubarchive-data/2015-01-01-*.json.gz")
rdd.take(2)

Py4JJavaError: An error occurred while calling o25.partitions.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2EEACD7FB2D731F3, AWS Error Code: InvalidAccessKeyId, AWS Error Message: The AWS Access Key Id you provided does not exist in our records., S3 Extended Request ID: GZakDXNA+ZbAdi7xp24fiIxWJEZIatmAi4P2sCg+w5+XbidUGy88hcGZ/ZwNDq/8GZGwycJUS6g=

prep.py Error

It seems that google have updated their API, so when running prep.py it raises a remote error:
raise RemoteDataError('Unable to read URL: {0}'.format(url)) pandas_datareader._utils.RemoteDataError: Unable to read URL: http://www.google. com/finance/historical?q=usb&startdate=Jan+27%2C+2017&enddate=Jan+27%2C+2018&out put=csv
Is there a way the offline versions of JSON files could be made available?

Spark package

Our two choices for pyspark on anaconda.org are either the conda-forge or quasiben channel.

it looks like conda-forge is 2.7 only?

(parallel) mrocklin@carbon:~$ conda install -c conda-forge pyspark
Fetching package metadata ...........
Solving package specifications: .


UnsatisfiableError: The following specifications were found to be in conflict:
  - pyspark -> python 2.7*
  - python 3.6*
Use "conda info <package>" to see the dependencies for each package.

While the quasiben package lacks support for python 3.6

(parallel) mrocklin@carbon:~$ conda install -c quasiben spark
Fetching package metadata ...........
Solving package specifications: .


UnsatisfiableError: The following specifications were found to be in conflict:
  - python 3.6*
  - spark -> py4j ==0.10.1 -> python 3.5* -> openssl 1.0.1*
  - spark -> py4j ==0.10.1 -> python 3.5* -> xz 5.0.5
Use "conda info <package>" to see the dependencies for each package.

Do we use quasiben and force 3.5? Do we ask @quasiben to update his package to 3.6? Do we ask conda-forge people to update the pyspark package?

Rename repository / org

Currently this lives at mrocklin/parallel-data-analysis .

Should we move this to a different org? Perhaps we're ready to ask PyData to use their space?

Should we rename this? parallel-tutorial perhaps?

@minrk @quasiben @ahmadia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.