pydata / parallel-tutorial Goto Github PK

Parallel computing in Python tutorial materials

Jupyter Notebook 93.48% Python 6.52%

parallel-tutorial's Introduction

Parallel Python: Analyzing Large Datasets

Student Goals

Students will walk away with a high-level understanding of both parallel problems and how to reason about parallel computing frameworks. They will also walk away with hands-on experience using a variety of frameworks easily accessible from Python.

Student Level

Knowledge of Python and general familiarity with the Jupyter notebook are assumed. This is generally aimed at a beginning to intermediate audience.

Outline

For the first half we cover basic ideas and common patterns in parallel computing, including embarrassingly parallel map, unstructured asynchronous submit, and large collections.

For the second half we cover complications arising from distributed memory computing and exercise the lessons learned in the first section by running informative examples on provided clusters.

Part one
- Parallel Map
- Asynchronous Futures
- High Level Datasets
Part two
- Scaling cross validation parameter search
- Tabular data with map/submit
- Tabular data with dataframes

Installation

Download this repository:

git clone https://github.com/pydata/parallel-tutorial

or download as a zip file.

Install Anaconda (large) or Miniconda (small)

Create a new conda environment:

 conda env create -f environment.yml
 source activate parallel  # Linux OS/X
 activate parallel         # Windows

If you want to use Spark (this is a large download):
```
 conda install -c conda-forge pyspark
```

Test your installation:

python -c "import concurrent.futures, dask, jupyter"

Dataset Preparation

We will generate a dataset for use locally. This will take up about 1GB of space in a new local directory, data/.

python prep.py

Part 1: Local Notebooks

Part one of this tutorial takes place on your laptop, using multiple cores. Run Jupyter Notebook locally and navigate to the notebooks/ directory.

jupyter notebook

The notebooks are ordered 1, 2, 3, so you can start with 01-map.ipynb

Part 2: Remote Clusters

Part two of this tutorial takes place on a remote cluster.

Visit the following page to start an eight-node cluster: https://pydata-parallel.jovyan.org/

If at any point your cluster fails you can always start a new one by re-visiting this page.

Warning: your cluster will be deleted when you close out. If you want to save your work you will need to Download your notebooks explicitly.

Slides

Brief, high level slides exist at http://pydata.github.io/parallel-tutorial/.

parallel-tutorial's People

Contributors

Stargazers

Watchers

Forkers

minrk ahmadia danielballan tacaswell faical-yannick-congo richardotis steve855 luojianp konggas ricleal qiaoqiaozhu ikding dyr0916 wenjinglu ayota rgbkrk vav1288 japneetsingh slawler robintux nagyist giserh sharplu fzn0728 smbeli adriannr bsaleckpay aabercrombie0492 ipsolar johncrickett niharikahubli camaya714 wangjiahong kevinwkc hendra-herviawan gitter-badger lucapinello vermuz cbohara cicdw seekshreyas sameera2004 indera umandalroald ijustloveses rbunn80110 mrakitin knhtown zhezhe123 eotp redwa ibrahim85 seanreed1111 franktoffel arinarmo yiwen-jiao marcolussetti ferdonline electronicshelf w3ss khanhdinh petuchen shivamanhar armgong tlokeshkumar xiaobinjlu magellen eddienko agamtomar rameshjay saneshashank oneoverseven daniellsm barretthugh srikanthgr1 nicolasyounes 3rico shaozw deepak-k-zefr bejany mahmud83 mdtdev vtan707 atgoud rafaelsandroni weikang9009 kanodiaayush sherkhan86 vagnerdesigner dlnash awesome-archive dkinneybu pamani codeur66 lc4311 sh4zkh4n joskid world4jason bomartinez amitkus

parallel-tutorial's Issues

No such file or directory: 'snakeviz'

Hi, when I run the script in 01-map.ipynb, and I always got this error msg.

I have installed snakeviz 0.4.1, any clues? thanks

FileNotFoundError: [Errno 2] No such file or directory: 'snakeviz'

%%snakeviz

for fn in filenames:
    print(fn)
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

Add pytables to installation and correct sklearn for OS X

conda install of sklearn failed. Needed to use scikit-learn instead of sklearn.

For notebook 1, I needed to: conda install pytables

README environment set-up, separate environment?

Given all the tutorials students are running, may be better to advise them to create a separate environment for this tutorial, rather than using root. For example, consider including the additional step below into the README, instead of assuming they should install all the packages for this tutorial into root:

conda create -n parallel
source activate parallel
conda install jupyter pytables matplotlib scikit-learn

having issues with initial test

I am trying to make sure I have everything up and running before the Strata Conference next week!

Running Windows 10 Education, version 1703, 64-bit
Using Git Bash

Followed the install instructions for the parallel-tutorial
When I attempt to test the installation, I get an error.

Operator error?

graphviz

Add graphviz to env file and cluster env

cv_params_demo is missing on the cluster

Add gcsfs to image

pip install git+https://github.com/dask/gcsfs

cc @quasiben

License

cc @minrk @quasiben @ahmadia

Is everyone here ok adding a note to the README that it's fine to deliver this tutorial?

Prep.py throws an AssertionError

Hi,

I just wanted to try the tutorial after I found this video on YouTube:
https://www.youtube.com/watch?v=5Md_sSsN51k

I use Anaconda on Windows 10, installed all necessary packages and after I tried to execute the first line in the notebook, I get the following error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
C:\Users\Name\Documents\Python Scripts\parallel-tutorial-master\parallel-tutorial-master\prep.py in <module>()
     39 
     40 for symbol in stocks:
---> 41     write_stock(symbol)
     42 
     43 

C:\Users\Name\Documents\Python Scripts\parallel-tutorial-master\parallel-tutorial-master\prep.py in write_stock(symbol)
     35         names = [str(ts.date()) for ts in df.divisions]
     36         df.to_csv(os.path.join(here, 'data', 'minute', symbol, '*.csv'),
---> 37                   name_function=names.__getitem__)
     38         print("Finished CSV: %s" % symbol)
     39 

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\core.py in to_csv(self, filename, **kwargs)
    957         """ See dd.to_csv docstring for more information """
    958         from .io import to_csv
--> 959         return to_csv(self, filename, **kwargs)
    960 
    961     def to_delayed(self):

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py in to_csv(df, filename, name_function, compression, compute, get, storage_options, **kwargs)
    503 
    504     if compute:
--> 505         delayed(values).compute(get=get)
    506     else:
    507         return values

C:\Users\Name\Anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
     97             Extra keywords to forward to the scheduler ``get`` function.
     98         """
---> 99         (result,) = compute(self, traverse=False, **kwargs)
    100         return result
    101 

C:\Users\Name\Anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    204     dsk = collections_to_dsk(variables, opNameize_graph, **kwargs)
    205     keys = [var._keys() for var in variables]
--> 206     results = get(dsk, keys, **kwargs)
    207 
    208     results_iter = iter(results)

C:\Users\Name\Anaconda3\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     73     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     74                         cache=cache, get_id=_thread_get_id,
---> 75                         pack_exception=pack_exception, **kwargs)
     76 
     77     # Cleanup pools associated to dead threads

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

C:\Users\Name\Anaconda3\lib\site-packages\dask\compatibility.py in reraise(exc, tb)
     58         if exc.__traceback__ is not tb:
     59             raise exc.with_traceback(tb)
---> 60         raise exc
     61 
     62 else:

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

C:\Users\Name\Anaconda3\lib\site-packages\dask\local.py in _execute_task(arg, cache, dsk)
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

C:\Users\Name\Anaconda3\lib\site-packages\dask\compatibility.py in apply(func, args, kwargs)
     45     def apply(func, args, kwargs=None):
     46         if kwargs:
---> 47             return func(*args, **kwargs)
     48         else:
     49             return func(*args)

C:\Users\Name\Anaconda3\lib\site-packages\dask\dataframe\io\demo.py in generate_day(date, open, high, low, close, volume, freq, random_state)
    114         values += np.linspace(open - values[0], close - values[-1],
    115                               len(values))  # endpoints
--> 116         assert np.allclose(open, values[0])
    117         assert np.allclose(close, values[-1])
    118 

AssertionError:

I also tried it unter MacOS Sierra with Miniconda and an extra enviroment as well as under Ubuntu 17.04 with Miniconda. And now I am out of operating systems 😄

Error when installing prep.py - ImportError: No module named '_pandasujson'

Traceback (most recent call last):
  File "prep.py", line 64, in <module>
    dask.compute(values)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/base.py", line 204, in compute
    results = get(dsk, keys, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/multiprocessing.py", line 177, in get
    raise_exception=reraise, **kwargs)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/compatibility.py", line 59, in reraise
    raise exc.with_traceback(tb)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/dask/local.py", line 289, in execute_task
    task, data = loads(task_info)
  File "/Users/aron/anaconda3/envs/parallel/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 840, in subimport
    __import__(name)
ImportError: No module named '_pandasujson'

Notebook 03a, SparkContext() will not run without modifying `/etc/hosts` on OSX.

Students in your tutorial may encounter this. The following code cell generates an error without updating /etc/hosts

sc = SparkContext('local[4]')

The error is as following:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries!

Adding the following to the end of /etc/hosts enabled me to run the cell successfully:

127.0.0.1 <hostname>

Where <hostname> is the output from calling hostname from the shell.

This Github download does not work on Windows / Anaconda

Thanks for the awesome tutorial. I understand that the function to create fake data does not work well resulting in non-functional windows set up. Could you please provide a fix?

Thanks in advance!

google finance dataset deprecated

Hi there,
it seems that the google finance API is deprecated and as such the "prep.py" file is not working anymore!
can someone describe what has to be changed in the prep.py or the assorted linked files so we can go along with this amazing and very informative tutorial ???

cheers,
ps: cloned this repository on a mac

notebook 3b, solution `collections-2.py` throws exception

ValueError: max() arg is an empty sequence

Spark S3 access

I suspect that this is because we have yet to include credentials. Still, reporting it here just in case.

rdd = sc.textFile("s3a://githubarchive-data/2015-01-01-*.json.gz")
rdd.take(2)

Py4JJavaError: An error occurred while calling o25.partitions.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2EEACD7FB2D731F3, AWS Error Code: InvalidAccessKeyId, AWS Error Message: The AWS Access Key Id you provided does not exist in our records., S3 Extended Request ID: GZakDXNA+ZbAdi7xp24fiIxWJEZIatmAi4P2sCg+w5+XbidUGy88hcGZ/ZwNDq/8GZGwycJUS6g=

prep.py Error

It seems that google have updated their API, so when running prep.py it raises a remote error:
raise RemoteDataError('Unable to read URL: {0}'.format(url)) pandas_datareader._utils.RemoteDataError: Unable to read URL: http://www.google. com/finance/historical?q=usb&startdate=Jan+27%2C+2017&enddate=Jan+27%2C+2018&out put=csv
Is there a way the offline versions of JSON files could be made available?

Spark package

Our two choices for pyspark on anaconda.org are either the conda-forge or quasiben channel.

it looks like conda-forge is 2.7 only?

(parallel) mrocklin@carbon:~$ conda install -c conda-forge pyspark
Fetching package metadata ...........
Solving package specifications: .


UnsatisfiableError: The following specifications were found to be in conflict:
  - pyspark -> python 2.7*
  - python 3.6*
Use "conda info <package>" to see the dependencies for each package.

While the quasiben package lacks support for python 3.6

(parallel) mrocklin@carbon:~$ conda install -c quasiben spark
Fetching package metadata ...........
Solving package specifications: .


UnsatisfiableError: The following specifications were found to be in conflict:
  - python 3.6*
  - spark -> py4j ==0.10.1 -> python 3.5* -> openssl 1.0.1*
  - spark -> py4j ==0.10.1 -> python 3.5* -> xz 5.0.5
Use "conda info <package>" to see the dependencies for each package.

Do we use quasiben and force 3.5? Do we ask @quasiben to update his package to 3.6? Do we ask conda-forge people to update the pyspark package?

Rename repository / org

Currently this lives at mrocklin/parallel-data-analysis .

Should we move this to a different org? Perhaps we're ready to ask PyData to use their space?

Should we rename this? parallel-tutorial perhaps?

@minrk @quasiben @ahmadia