logicalclocks / maggy Goto Github PK

Distribution transparent Machine Learning experiments on Apache Spark

License: Apache License 2.0

Python 100.00%

hyperparameter-optimization hyperparameter-search automl ablation spark hyperparameter-tuning blackbox-optimization ablation-studies ablation-study

maggy's Introduction

Maggy is a framework for distribution transparent machine learning experiments on Apache Spark. In this post, we introduce a new unified framework for writing core ML training logic as oblivious training functions. Maggy enables you to reuse the same training code whether training small models on your laptop or reusing the same code to scale out hyperparameter tuning or distributed deep learning on a cluster. Maggy enables the replacement of the current waterfall development process for distributed ML applications, where code is rewritten at every stage to account for the different distribution context.

Maggy uses the same distribution transparent training function in all steps of the machine learning development process.

Quick Start

Maggy uses PySpark as an engine to distribute the training processes. To get started, install Maggy in the Python environment used by your Spark Cluster, or install Maggy in your local Python environment with the 'spark' extra, to run on Spark in local mode:

pip install maggy

The programming model consists of wrapping the code containing the model training inside a function. Inside that wrapper function provide all imports and parts that make up your experiment.

Single run experiment:

def train_fn():
    # This is your training iteration loop
    for i in range(number_iterations):
        ...
        # add the maggy reporter to report the metric to be optimized
        reporter.broadcast(metric=accuracy)
         ...
    # Return metric to be optimized or any metric to be logged
    return accuracy

from maggy import experiment
result = experiment.lagom(train_fn=train_fn, name='MNIST')

lagom is a Swedish word meaning "just the right amount". This is how MAggy uses your resources.

Documentation

Full documentation is available at maggy.ai

Contributing

There are various ways to contribute, and any contribution is welcome, please follow the CONTRIBUTING guide to get started.

Issues

Issues can be reported on the official GitHub repo of Maggy.

Citation

Please see our publications on maggy.ai to find out how to cite our work.

Acknowledgements

The development of Maggy is supported by the EU H2020 Deep Cube Project (Grant agreement ID: 101004188).

maggy's People

Contributors

Stargazers

Watchers

Forkers

moritzmeister ssheikholeslami tabularaza27 alessiomolinari carlee0 davitbzh stjordanis alexhopsworks crakama riccardogrigoletto robzor92 isabella232 ykandel jimdowling

maggy's Issues

Can Maggy be used with a Spark Cluster that uses YARN

I was wondering how Maggy knows
Question 1: How to contact Spark's driver to make the RPC calls in another Spark infrastructure that is not Hopsworks. Does having a resource manager such as YARN on top of Spark Cluster affect how Maggy should make RPC requests to the Spark driver ? or should it work as normal?

Question 2: If I opt to go for "you can deploy an entire Hopsworks instance to your own AWS account" as also explained here (https://hopsworks.readthedocs.io/en/stable/getting_started/installation_guide/platforms/aws-image.html), a two t2.2xlarge instance type that has 8 vCPUs and 32 GB RAM is a single host and not a cluster. Does the 8 available vCPU equate to the Executors that Spark will use ? Meaning that if I run my trials and would like to experiment the results from different executors I will only have a maximum of 8 available executors, unless if I increase the instance type ?

The reason behind Question 1: is because in this example here (https://github.com/logicalclocks/maggy/blob/master/examples/maggy-ablation-titanic-example.ipynb), a spark session is created but I cannot explicitly see the point where Maggy hands over/submits jobs to Spark. For Instance, In some of the industrial set ups, one would interact with Spark by creating a Spark session then submitting the job like so;

Creating spark session

spark = SparkSession \ .builder \ .appName('spark-ipython') \ .config('spark.shuffle.service.enabled', 'true') \ .config('spark.executor.memory', '2844M') \ .config('spark.dynamicAllocation.enabled', 'true') \ .config('spark.dynamicAllocation.minExecutors', '0') \ .config('spark.dynamicAllocation.maxExecutors', '100') \ .getOrCreate()

Submitting Job to Spark Cluster

spark-submit --master yarn --deploy-mode cluster --archives hdfs:///somelocation/Python.zip#Python --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python3 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./Python/bin/python3 main.py whereby main.py contains Maggy code

Example of main.py file

 `  from maggy import Searchspace

### The searchspace can be instantiated with parameters
sp = Searchspace(kernel=('INTEGER', [2, 8]), pool=('INTEGER', [2, 8]))

 # Or additional parameters can be added one by one
 sp.add('dropout', ('DOUBLE', [0.01, 0.99]))

 from maggy import experiment
 from maggy.callbacks import KerasBatchEnd
 #########
 ### maggy: hyperparameters as arguments and including the reporter
#########
def keras(kernel, pool, dropout, reporter):
     from tensorflow.python import keras
     import tensorflow as tf
     from tensorflow.python.keras.datasets import mnist
     from tensorflow.python.keras.models import Sequential
     from tensorflow.python.keras.layers import Dense, Dropout, Flatten
     from tensorflow.python.keras.layers import Conv2D, MaxPooling2D
     from tensorflow.python.keras.callbacks import TensorBoard

from tensorflow.python.keras import backend as K
import math

batch_size = 512
num_classes = 10
epochs = 1

# Input image dimensions
img_rows, img_cols = 28, 28

# The data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, kernel_size=(kernel, kernel),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (kernel, kernel), activation='relu'))
model.add(MaxPooling2D(pool_size=(pool, pool)))
model.add(Dropout(dropout))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(num_classes, activation='softmax'))

opt = keras.optimizers.Adadelta(1.0)

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=opt,
              metrics=['accuracy'])
#########
### maggy: REPORTER API through keras callback
#########
callbacks = [KerasBatchEnd(reporter, metric='acc')]

model.fit(x_train, y_train,
          batch_size=batch_size,
          callbacks=callbacks, # add callback
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#########
### maggy: return the metric to be optimized, test accuracy in this case
#########
return score[1]

Infer Number of Available Executors on Different Environments

If we want to support Maggy on arbitrary Spark clusters, we need a reliable way of inferring the number of executors that are available on the Spark cluster.

Additionally, as a fallback the user should be able to specify a lower number of executors in case it's a shared cluster.

We have to look at the following properties:
spark.dynamicAllocation.maxExecutors
spark.executor.instances for static

Find equivalents on Databricks.

AttributeError: module 'maggy.experiment' has no attribute 'lagom'

pip install the latest version of maggy(version1.1.0), and run a simple maggy example and it is not working.
import maggy
from maggy import experiment
...
result = experiment.lagom(train_fn=training_fn, name='MNIST')

returns AttributeError: module 'maggy.experiment' has no attribute 'lagom'

Create Environment classes with common interface

This issue can be split in multiple PRs:

PR 1
Base Environment class
Hopsworks Environment class
Integration
---- until here nothing should change

PR2
Databricks Environment Class

PR3
Local Environment Class

Is maggy applicable to my use case?

Hi, I've just found this library and it seems great, but wanted to quickly double-check if it's applicable to my use case. Namely, I have a large amount of tabular data stored in Spark DataFrames (so the data is distributed on multiple machines) on databricks and I'm using a Spark ML model. Will I be able to run trials in parallel with such setting using maggy?

Replace Black Box term with Opaque or similar neutral terms

The terms black box / white box are seen borderline racist by many due to common perception of black being used for something negative and white with something good. The racist implications of such terms are not always clear, and there is rarely consensus. It may be however better to avoid and replace such terms with more neutral language. It is therefore requested to use more neutral terminology instead of such oppressive language.
Lets replace:

blacklist / whitelist -> blocklist / allowlist
master / slave to leader / worker
white/black box -> clear/opaque or closed/glass box
whitespace -> Emptyspace

The term black box appears:

line 54 of setup.py
line 7 of README.rst
line 64/66 of maggy/experiment.py
line 33 of maggy/optimizer/bayes/base.py

Distribution Driver to integrate in along with Optimization and Ablation Drivers

Define common base driver interface
Implement three derived drivers: Optimization, Distribution, Ablation
2.1 Optimization and ablation functionality needs to be rearranged

Maggy requirements too strict

Distributed training using tensorflow

Use maggy for distributed training given a Tensorflow model.
The solution will work similarly to the Torch distributed training already implemented.

lagom() got an unexpected keyword argument 'searchspace'

TypeError Traceback (most recent call last)
in
----> 1 result = experiment.lagom(embeddings_computer,
2 searchspace=sp,
3 optimizer='randomsearch',
4 direction='max',
5 num_trials=2,

TypeError: lagom() got an unexpected keyword argument 'searchspace'

DistributionExecutor function TensorFlow

[Ablation] Develop a dataset generator function for databricks

As per https://github.com/logicalclocks/maggy/blob/master/maggy/ablation/ablationstudy.py , we need to write a custom function for the dataset generator to make AblationStudy to work on databricks.

Upgrade Ablation to use new HSFS APIs for feature store access

ModuleNotFoundError: No module named 'maggy.experiment_config'

I have can install maggy on my PySpark cluster from pip but whenever I issue this command from maggy.experiment_config import OptimizationConfig I get the error ModuleNotFoundError: No module named 'maggy.experiment_config'. Any idea of what could be happening. I am using JupiterLab with Python3 Kenrnel

from maggy import experiment says no module named hops.

The imports that I have seen working are from maggy.ablation import AblationStudy and from maggy import Searchspace

Edit:
I noticed that I cannot use pre-release version (if it has a fix of this that is). I get this error when I try to install the pre-release version
ERROR: Could not find a version that satisfies the requirement maggy==1.0.0rc0 (from versions: 0.0.1, 0.1, 0.1.1, 0.2, 0.2.1, 0.2.2, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.4.0, 0.4.1, 0.4.2, 0.5.0, 0.5.1, 0.5.2, 0.5.3)
ERROR: No matching distribution found for maggy==1.0.0rc0

Run on Python kernel

Add gridsearch optimizer

Infer IP address of Driver/Worker

Currently on hopsworks we bind a socket to a random port on the different hosts, to get the IP address and send it to the driver.

On databricks this doesn't work because Databricks sets some 127.0.1.1 host for their machines. So there are work arounds:
Get Driver IP from Spark config, then let workers connect and get IP address from connection.
Ping Google DNS and hope the workers only have one network interface.
Do the workers also have a config property with their IP?

Any Spark cluster: which is the most reliable method here?