ibm / ffdl Goto Github PK

Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes

Home Page: https://developer.ibm.com/code/patterns/deploy-and-use-a-multi-framework-deep-learning-platform-on-kubernetes/

License: Apache License 2.0

Makefile 3.28% Shell 5.70% Go 50.02% TypeScript 7.42% JavaScript 0.22% HTML 1.64% CSS 0.52% Python 20.29% Jupyter Notebook 9.71% Dockerfile 1.21%

ai ml deep-learning kubernetes-cluster tensorflow caffe pytorch keras machine-learning deeplearning

ffdl's Introduction

Read this in other languages: 中文.

Fabric for Deep Learning (FfDL)

This repository contains the core services of the FfDL (Fabric for Deep Learning) platform. FfDL is an operating system "fabric" for Deep Learning. It is a collaboration platform for:

Framework-independent training of Deep Learning models on distributed hardware
Open Deep Learning APIs
Running Deep Learning hosting in user's private or public cloud

To know more about the architectural details, please read the design document. If you are looking for demos, slides, collaterals, blogs, webinars and other materials related to FfDL, please find them here

Prerequisites

kubectl: The Kubernetes command line interface (https://kubernetes.io/docs/tasks/tools/install-kubectl/)
helm: The Kubernetes package manager (https://helm.sh)
docker: The Docker command-line interface (https://www.docker.com/)
S3 CLI: The command-line interface to configure your Object Storage
An existing Kubernetes cluster (e.g., Kubeadm-DIND for local testing or Follow the appropriate instructions for standing up your Kubernetes cluster using IBM Cloud Public or IBM Cloud Private). The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs.

Usage Scenarios

If you are getting started and want to setup your own FfDL deployment, please follow the steps below.
If you have a FfDL deployment up and running, you can jump to FfDL User Guide to use FfDL for training your deep learning models.
If you want to leverage Jupyter notebooks to launch training on your FfDL cluster, please follow these instructions
If you have FfDL configured to use GPUs, and want to train using GPUs, follow steps here
To invoke Adversarial Robustness Toolbox to find vulnerabilities in your models, follow the instructions here
To deploy your trained models, follow the integration guide with Seldon
If you have used FfDL to train your models, and want to use a GPU enabled public cloud hosted service for further training and serving, please follow instructions here to train and serve your models using Watson Studio Deep Learning service.

Steps

Quick Start

1.1 Installation using Kubernetes Cluster
1.2 Installation using Kubeadm-DIND

Test
Monitoring
Development
Clean Up
Troubleshooting
References

1. Quick Start

There are multiple installation paths for installing FfDL into an existing Kubernetes cluster. Below are the steps for quick install. If you want to follow more detailed step by step instructions , please visit the detailed installation guide

You need to initialize tiller with helm init before running the following commands.

1.1 Installation using Kubernetes Cluster

To install FfDL to any proper Kubernetes cluster, make sure kubectl points to the right namespace, then deploy the platform services:

export NAMESPACE=default # If your namespace does not exist yet, please create the namespace `kubectl create namespace $NAMESPACE` before running the make commands below
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold" # Change the storage class to what's available on your Cloud Kubernetes Cluster.

helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE # Configure s3 driver on the cluster
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the helper micro-services for ffdl
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait # Deploy all the core ffdl services.

1.2 Installation using Kubeadm-DIND

If you have Kubeadm-DIND installed on your machine, use these commands to deploy the FfDL platform:

export SHARED_VOLUME_STORAGE_CLASS=""
export NAMESPACE=default

./bin/s3_driver.sh # Copy the s3 drivers to each of the DIND node
helm install ibmcloud-object-storage-plugin --name ibmcloud-object-storage-plugin --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,cloud=false
helm install ffdl-helper --name ffdl-helper --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS,localstorage=true --wait
helm install ffdl-core --name ffdl-core --repo https://ibm.github.io/FfDL/helm-charts --set namespace=$NAMESPACE,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS --wait

# Forward the necessary microservices from the DIND cluster to your localhost.
./bin/dind-port-forward.sh

2. Test

To submit a simple example training job that is included in this repo (see etc/examples folder):

Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. You can check your Cluster Public IP with kubectl get nodes -o wide. For IBM Cloud, you can get your Public IP with bx cs workers <cluster_name>.

export PUBLIC_IP=<Cluster Public IP> # Put down localhost if you are running with Kubeadm-DIND
make test-push-data-s3
make test-job-submit

3. Monitoring

The platform ships with a simple Grafana monitoring dashboard. The URL is printed out when running the status make target.

4. Development

Please refer to the developer guide for more details.

5. Clean Up

If you want to remove FfDL from your cluster, simply use the following commands.

helm delete --purge ffdl-core ffdl-helper

If you want to remove the storage driver from your cluster, run:

helm delete --purge ibmcloud-object-storage-plugin

For Kubeadm-DIND, you need to kill your forwarded ports. Note that the below command will kill all the ports that are created with kubectl.

kill $(lsof -i | grep kubectl | awk '{printf $2 " " }')

6. Troubleshooting

FfDL has only been tested under Mac OS and Linux

If glide install fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"), make sure to follow the standard Go directory layout (see Prerequisites section).
To remove FfDL on your Cluster, simply run make undeploy
When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes / at the end.
If your job is stuck in pending stage, you can try to redeploy the plugin with helm install storage-plugin --set dind=true,cloud=false for Kubeadm-DIND and helm install storage-plugin for general Kubernetes Cluster. Also, double check your training job manifest file to make sure you have the correct object storage credentials.

7. References

Based on IBM Research work in Deep Learning.

B. Bhattacharjee et al., "IBM Deep Learning Service," in IBM Journal of Research and Development, vol. 61, no. 4, pp. 10:1-10:11, July-Sept. 1 2017. https://arxiv.org/abs/1709.05871
Scott Boag, et al. Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs, In Workshop on ML Systems at NIPS'17, 2017. http://learningsys.org/nips17/assets/papers/paper_29.pdf

ffdl's People

Contributors

Stargazers

Watchers

Forkers

whummer sboagibm vinodmut 4projects sanjw kant zhangjm12 intuitionmachine hailiang-wang micseb hongtao12310 benzei bpradipt ccortezb trinhdinhphuong fireae jdc08161063 opencvfun cmluciano yangkf1985 amitg90 snazz2001 mbarce rkhalaf visahak gregsimpson wangmiao1981 osswangxining lukw00heck adityavs xieydd ursulaquan deepkapha dmatch01 spencerx frosenberg hephaex shubhampachori12110095 cclauss castbld hendrikmax akchinstc nkpng2k animeshsingh cwickniss vishalbelsare ckadner amitkumarj441 wei-he meelement hsaputra charlesbrandontheis jiangxl2018 cloustone kevinyu98 just4jc basavaraju-g ginusxiao adamjm abel1225 taressahalpin afcarl wwalisa bkalapar azzajab gridl arifsajal sdmonov huyahuioo k82cn jackyhuang0 whmnoe4j nisarpro jayaramkr jmscraig arifulmondal stephanzf lkrishnamurthy lengrongfu philipz ffdl j3ffyang prpankajsingh momor666 awesomemachinelearning kkm24132 cnzebra tonylv xacrolyte empires2 mzkaramat adrian555 shafiahmed lambertzhao1991 algoskynet duoergun0729 fly-luck tomcli siju-samuel twistedtree

ffdl's Issues

Logs --follow process times out after 4 minutes

It used to be that the FfDL CLI command to follow the logs of an ongoing training job $FFDL_CMD logs --follow ${MODEL_ID} would tail the training logs until completion of the training job. The logs --follow process returned control only after the training job was complete. This was a useful feature when chaining up commands to create a semi-automated machine learning pipeline, where subsequent commands require the output data of the training job whose logs are being "followed". We have a small example of such a training pipeline in our ART notebook which is currently broken.

That behavior changed with the merge of PR #79. Now the the $FFDL_CMD logs --follow ${MODEL_ID} process terminates after 4 minutes -- usually before the training job is completed -- which causes the failure of subsequent processes that depend on training output data.

Code change causing the regression:

https://github.com/IBM/FfDL/pull/79/files?utf8=%E2%9C%93&diff=split&w=1#diff-7376976023aba3c29977b24e4794f938R1406

-	var ctx context.Context
-	var cancel context.CancelFunc
-	logr.Debugf("follow is %t", req.Follow)
-	if req.Follow {
-		ctx, cancel = context.WithTimeout(context.Background(), 10*(time.Hour*24))
-	} else {
-		ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
-      }
+	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*4)
 	defer cancel()

Command output showing the behavior:

Showing prematurely aborted logs --follow process with apparent 4 min timeout.

$FFDL_CMD train manifest.yml model.zip

Deploying model with manifest 'manifest.yml' and model file 'model.zip'...
Model ID: training-uLQ7ZMDmR
OK

$FFDL_CMD logs --follow training-uLQ7ZMDmR  &&  date

Getting model training logs for 'training-uLQ7ZMDmR'...
Training with training/test data at:
  DATA_DIR: /mnt/data/training-data-bbe28e19-4fba-4e29-af5f-564f0e0d3f53
  MODEL_DIR: /job/model-code
  TRAINING_JOB: 
  TRAINING_COMMAND: pip3 install keras; python3 convolutional_keras.py --data ${DATA_DIR}/mnist.npz
...
Wed Jun 27 19:19:34 UTC 2018: Running training job
...
Train on 54000 samples, validate on 6000 samples
Epoch 1/1
  128/54000 [..............................] - ETA: 5:21 - loss: 2.2977 - acc: 0.1562
  256/54000 [..............................] - ETA: 4:37 - loss: 2.2591 - acc: 0.1758
...
45184/54000 [========================>.....] - ETA: 39s - loss: 0.3311 - acc: 0.8972
45312/54000 [========================>.....] - ETA: 39s - loss: 0.3305 - acc: 0.8974
45440/54000 [========================>.....] - ETA: 38s - loss: 0.3299 - acc: 0.8976

Wed Jun 27 12:23:35 PDT 2018

Notice the time stamps:
Wed Jun 27 19:19:34 UTC 2018: Running training job -> date/time training starts
Wed Jun 27 12:23:35 PDT 2018 > date/time just after the logs job returns (after 4 min)

Syncing with the latest internal code base

The current learner implementation breaks in Kubernetes 1.9.4 and above

As mentioned in #45, Kubernetes 1.9.4 changes the secret, configMap, downwardAPI and projected volumes to mount with read-only. Since the current learner implementation needs write access to its mounted volume, the temporary solution is to set the feature gate ReadOnlyAPIDataVolumes=false. We should change the learner implementation so that it can work with read only access on the mounted volume.

Training status is PENDING not change

I can‘t get any error log ...

Travis should test pulling external images for build

Currently its following the complete build process, including building images locally

VCK integration proposal

Integration Proposal

Implement a new module that handles creating the volumemanage for VCK
Insert logic to provision volumemanage resource and monitor it for completion before executing the training job workload.
To make it more elastic, we need to come up with some algorithm on how much data replicas we need for each job. Then create some labels/tags to allow users to reuse the same dataset volume.
Need to figure out a shared file storage for all the learner pods (required for many distributed learning methods) and a way to store the model results for our users.

For more details, please refer to https://github.com/IBM/FfDL/blob/vck-patch/etc/examples/vck-integration.md

[Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md

Some CLI-specific instructions in this document are based on older versions of the command line. They should be updated to reflect the latest release, which uses ibmcloud instead of bx as the binary.

Examples:

Prerequistes section:

Current:

bx plugin repo-add Bluemix https://plugins.ng.bluemix.net
bx plugin install machine-learning -r bluemix
bx target -o ORG -s SPACE

Simpler version: (no need to add the repo - it's the default; no target required because WML service is now resource managed)

ibmcloud plugin install  machine-learning

section 1.1

Current:

bx cf create-service pm-20 lite watson-machine-learning
bx cf create-service-key watson-machine-learning WML-Key

New:

ibmcloud resource service-instance-create watson-machine-learning pm-20 lite us-south
ibmcloud resource service-key-create WML-Key Writer --instance-name watson-machine-learning

section 1.2

ibmcloud resource service-key WML-Key

Here on it should only be necessary to replace bx with ibmcloud
...

There's a good chance that other documentation is impacted as well. This is the first doc I've tried to follow.

Add support to switch to GPU learner images seamlessly

Currently we had a hard-coded solution to use GPU learner image with accelerator at the gpu-dev branch. We will need to implement a trigger to pull the correct learner image without using the hard-coded solution.

FfDL CLI output is not properly machine parsable

Using the ffdl cli to get a model, returns YAML or JSON, but in both cases an header line "Getting model xyz" is included, which breaks parsing.

# ffdl show training-II-h6nxmR --json 
Getting model 'training-II-h6nxmR'...
{
	"Payload": {
		"model_id": "training-II-h6nxmR",
...

Both JSON|YAML and the message are sent to stdout so the only way to separate them is to grep...

# ffdl show training-II-h6nxmR --json | egrep -v "^Getting model" | jq .Payload.training.training_status
{
  "completed": "1539854246722",
  "status": "COMPLETED",
  "status_description": "COMPLETED",
  "submitted": "1539853988330"
}

Grafana charts shows no data points

Hi, I've installed FfDL in a completely offline kubernetes cluster:

Imported all the necessary docker images to each cluster node.
Inited tiller with specified image so it won't pull from the Internet.
Installed FfDL using helm.
Trained the example model according to your instructions.

Everything worked fine, and I've got the training results, but Grafana showed nothing but mostly a 'no data points' hint on its panel.
Four dashboards:

And I can't find any useful Prometheus or Grafana logs.

BTW, I've commented out the env variable 'GF_INSTALL_PLUGINS' in spec of container 'grafana' in templates/monitoring/prometheus-deployment.yml, for it would try to download from the Internet.

Any hint on what's missing? Thanks!

Build and push latest version of Docker images on master builds

We need to set up a proper pipeline, to build and push the latest version of the Docker images on green master builds.

Move constants values to be configurable in helm chart

For best practice, we shouldn't have some of the resource constants hard coded in the code base. This will cause issue such as #13 hard to fix. Also, making some of these constants configurable in helm chart can allow users deploy FfDL based on their Kubernetes Cluster size.

UI update proposal

Our intern Andrew has done a great job on updating and fixing some of our UI bugs. We should review and pull this in to master/ffdl branch when we have time.

https://github.com/drewbutlerbb4/FfDL

FfDL UI issue: missing font-awesome.min.css

The Font awesome library wasn't included in the FfDL UI. We should include it and rebuild our UI image.

Test Minikube v0.26.0 with k8s v1.10.0 on Travis CI

Investigate what changes need to make for migrating FfDL to latest minikube version and k8s v1.10.0.

dind-port-forward.sh -> invalid resource name ?

if i execute the script, I will get error look similar below:
root@ffdl2018:~/FfDL/bin# kubectl port-forward pod/$ui_pod $ui_port:8080
error: invalid resource name "pod/": [may not contain '/']

So I tried to remove the pod/ thinking maybe newer version of kubeadmin-dind look like the pod/ , but i get different error below. Can someone help me with the error message below?

Forwarding from 127.0.0.1:31300 -> 8080
Handling connection for 30029
E1031 14:22:28.129745 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:28 socat[11424] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:30.160553 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:30 socat[11441] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:32.191360 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:32 socat[11492] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:34.225286 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:34 socat[11493] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
creating data source...
Handling connection for 30029
set up dashboards
Handling connection for 30029
Finished

Moderate security vulnerability

Moderate severity

The moment module before 2.19.3 for Node.js is prone to a regular expression denial of service via a crafted date str...

package-lock.json update suggested:
moment ~> 2.19.3

Add recommended resource specs to README

We should add a note to the README that 4GB+ RAM is the recommended configuration for Minikube.

Setup more complicated than 3 steps in README

Documented commands do not show all how to setup FfDL from scratch

I noticed that setting up FfDL on macOS is more involved than the steps in the documentation - steps like make minikube, eval $(minikube docker-env) or make docker-build-base are omitted and it would also help to have instructions on how to install dependencies. In general, the following instructions should work:

# Install Docker
# Approximately https://docs.docker.com/docker-for-mac/install/

# Install Go
brew install go
brew install glide  # Alternative: curl https://glide.sh/get | sh
export GOPATH=$HOME/go
echo "export GOPATH=$HOME/go" >> ~/.profile
export PATH=${GOPATH}/bin:$PATH
echo "export PATH=\$GOPATH/bin:\$PATH" >> ~/.profile
source ~/.profile

# Install Minikube
brew cask install virtualbox  # or use installer from https://www.virtualbox.org/wiki/Downloads
brew cask install minikube
brew install kubernetes-helm

# Hyperkit
curl -LO https://storage.googleapis.com/minikube/releases/latest/docker-machine-driver-hyperkit \
&& chmod +x docker-machine-driver-hyperkit \
&& sudo mv docker-machine-driver-hyperkit /usr/local/bin/ \
&& sudo chown root:wheel /usr/local/bin/docker-machine-driver-hyperkit \
&& sudo chmod u+s /usr/local/bin/docker-machine-driver-hyperkit
# Potential Alternative:
# brew install --build-from-source hyperkit

# Clone FfDL
mkdir -p $GOPATH/src/github.com/IBM && cd $_
git clone https://github.com/IBM/FfDL.git && cd FfDL

# Build FfDL
export VM_TYPE=minikube
# Modify Makefile and change MINIKUBE_DRIVER from xhyve to hyperkit
sed -i '' -e "s/MINIKUBE_DRIVER ?= xhyve/MINIKUBE_DRIVER ?= hyperkit/g" Makefile
glide install
make build
make minikube
eval $(minikube docker-env)
make docker-build-base
make docker-build
make deploy

With two minor things to add...

Probably need to install helm and kubectl as well
Need to add instructions to install Docker

...and three questions:

Would you like me to do this and submit a PR?
Which document should this go into? docs/setup-guide.md?
Do we want to add a fully automatic script like we provide for DIND with the new PR? What does the end game look like regarding setup? Will we try to build one master installation script set for all platforms, one set for each platform or do we ultimately want to push the entire setup into tools like Ansible or helm?

Thanks in advance.

PS regarding troubleshooting:
We should also consider adding a docs/troubleshooting.md.
For instance, I have seen the following issues on Minikube:

If make deploy dies after "Initializing..." most likely VM_TYPE=minikube was not set.
If make deploy gets stuck at "Installing helm/tiller..." most likely helm is not installed.
Does that make sense? Do you want me to seed a troubleshooting file as well? Can you think of additional common errors?

support for tensorflow distribution

Would you please tell us how to support tensorflow distributions train in FfDL.
As we known, there are worker tasks and parameter tasks in tensorflow, when using FfDL, should we specify the information clearly to FfDL.

Thanks

Learner Issue - ffdl/ffdl-controller not found

The Learner requires the ffdl-controller image and apparently is not available on the Public DockerHub.

Also, log-collector images (e.g. ffdl/tensorboard_extract_1.3-py3:latest) also not available on DockerHub.

build error

I am getting this error while running make docker-build command on a mac environment.

cd build/grpc-health-checker && make install-deps build-x86-64
glide -q install
[WARN]	The name listed in the config file (github.ibm.com/deep-learning-platform/grpc-health-checker) does not match the current location (github.com/IBM/FfDL/etc/dlaas-service-base/build/grpc-health-checker)
rm -rf bin/
CGO_ENABLED=0 GOOS=linux go build -ldflags "-s" -a -installsuffix cgo -o bin/grpc-health-checker
docker build -q -f Dockerfile.ubuntu -t dlaas-service-base:ubuntu16.04 .
sha256:6975b033728017afc0f9a2dd9978e76331411d323fef07c9b8600959cccdbf4a
docker tag dlaas-service-base:ubuntu16.04 docker.io/ffdl/dlaas-service-base:ubuntu16.04
docker build -q -f Dockerfile.alpine -t dlaas-service-base:alpine3.3 .
sha256:9e367f913da06d9758a26936bff18c23587e5eebfd424f9793a0cd488742e75b
docker tag dlaas-service-base:alpine3.3 docker.io/ffdl/dlaas-service-base:alpine3.3
(full_img_name=ffdl-metrics; \
		cd ./metrics/ && (if [ "minikube" = "minikube" ]; then eval $(minikube docker-env); fi; \
			docker build -q -t docker.io/ffdl/$full_img_name .))
Sending build context to Docker daemon  11.26MB
Step 1/6 : FROM dlaas-service-base:ubuntu16.04
pull access denied for dlaas-service-base, repository does not exist or may require 'docker login'
make[1]: *** [.docker-build] Error 1
make: *** [docker-build-metrics] Error 2```

Multiple Learners Training Job Fails

Raising this issue as per my conversation with Tommy earlier today.

minikube version: v0.25.2
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-21T21:48:36Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes: 1.9.4

When running a single learner job with the same python script, no issues. Whole process completes.

When running a multi learner job (only thing changed in the manifest is learners: 2 process fails.

Logs are as follows:

Nicholass-MBP:FfDL npng$ $CLI_CMD list
Getting all models ...
ID                   Name          Framework     Training status   Submitted   Completed
training-C6DTcIMmR   h2o3_automl   h2o3:latest   COMPLETED         N/A         N/A
training-TOnw5SGiR   h2o3_automl   h2o3:latest   FAILED            N/A         N/A

2 records found.
Nicholass-MBP:FfDL npng$ $CLI_CMD logs training-TOnw5SGiR
Getting model training logs for 'training-TOnw5SGiR'...
Status: FAILED
Cannot read trained model log: rpc error: code = Unknown desc = NoSuchKey: The specified key does not exist.
	status code: 404, request id: , host id: Nicholass-MBP:FfDL npng$

Nicholass-MBP:FfDL npng$ kubectl get pods
NAME                                                              READY     STATUS             RESTARTS   AGE
alertmanager-78676b6756-2l2zb                                     1/1       Running            0          32m
etcd0                                                             1/1       Running            0          32m
ffdl-lcm-dd5f59b55-bm52q                                          1/1       Running            0          32m
ffdl-restapi-7789dbdf5f-2j4mh                                     1/1       Running            0          32m
ffdl-trainer-59bd46cfdb-9csqr                                     1/1       Running            2          32m
ffdl-trainingdata-688bf5f44b-48wqb                                1/1       Running            5          32m
ffdl-ui-6545f7dd5b-lpqcd                                          1/1       Running            0          32m
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9     0/1       ImagePullBackOff   0          6m
jobmonitor-3be3332c-e2f4-4a2b-4775-3069398a12ba-64f9b94465s7gmh   1/1       Running            0          6m
learner-1-3be3332c-e2f4-4a2b-4775-3069398a12ba-f8d8b8c98-6drgz    0/7       Pending            0          6m
learner-2-3be3332c-e2f4-4a2b-4775-3069398a12ba-979949d49-9jv9f    0/7       Pending            0          6m
mongo-0                                                           1/1       Running            0          32m
prometheus-556d97b566-fmgkp                                       2/2       Running            0          32m
pushgateway-665b6c4b9-hg85s                                       2/2       Running            0          32m
storage-0                                                         1/1       Running            0          32m
Nicholass-MBP:FfDL npng$

Nicholass-MBP:FfDL npng$ kubectl describe pod grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Name:           grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Namespace:      default
Node:           minikube/192.168.99.100
Start Time:     Fri, 27 Apr 2018 15:45:33 -0700
Labels:         app=grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba
                pod-template-hash=3168270778
                service=dlaas-parameter-server
                training_id=training-TOnw5SGiR
Annotations:    <none>
Status:         Pending
IP:             172.17.0.16
Controlled By:  ReplicaSet/grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd
Containers:
  grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba:
    Container ID:
    Image:          docker.io/ffdl/parameter-server:master-97
    Image ID:
    Port:           50051/TCP
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  1048576k
    Requests:
      cpu:     500m
      memory:  1048576k
    Environment:
      JOBID:                  1111
      NUM_LEARNERS:           2
      TCP_PORT:               50051
      ZK_DIR:                 training-TOnw5SGiR/parameter-server
      ZK_DIR:                 training-TOnw5SGiR/parameter-server
      DLAAS_ETCD_ADDRESS:     <set to the key 'DLAAS_ETCD_ADDRESS' in secret 'lcm-secrets'>   Optional: false
      DLAAS_ETCD_USERNAME:    <set to the key 'DLAAS_ETCD_USERNAME' in secret 'lcm-secrets'>  Optional: false
      DLAAS_ETCD_PASSWORD:    <set to the key 'DLAAS_ETCD_PASSWORD' in secret 'lcm-secrets'>  Optional: false
      DLAAS_ETCD_PREFIX:      <set to the key 'DLAAS_ETCD_PREFIX' in secret 'lcm-secrets'>    Optional: false
      FOR_TEST:               1
      DLAAS_JOB_ID:           training-TOnw5SGiR
      ZNODE_NAME:             singleshard
      DATA_STORE_AUTHURL:     http://s3.default.svc.cluster.local
      MODEL_STORE_OBJECTID:   dlaas-models/training-TOnw5SGiR.zip
      RESULT_STORE_AUTHURL:   http://s3.default.svc.cluster.local
      RESULT_STORE_TYPE:      s3_datastore
      RESULT_STORE_USERNAME:  test
      MODEL_STORE_APIKEY:     test
      DATA_DIR:               h2o3_training_data
      DATA_STORE_TYPE:        s3_datastore
      MODEL_STORE_USERNAME:   test
      MODEL_DIR:              /model-code
      GPU_COUNT:              0.000000
      RESULT_DIR:             h2o3_trained_model
      DATA_STORE_OBJECTID:    h2o3_training_data
      SCHED_POLICY:           dense
      RESULT_STORE_OBJECTID:  h2o3_trained_model/training-TOnw5SGiR
      LOG_DIR:                /logs
      MODEL_STORE_AUTHURL:    http://s3.default.svc.cluster.local
      MODEL_STORE_TYPE:       s3_datastore
      DATA_STORE_USERNAME:    test
      DATA_STORE_APIKEY:      test
      RESULT_STORE_APIKEY:    test
      TRAINING_COMMAND:       python h2o3_baseline.py --trainDataFile ${DATA_DIR}/higgs_train_10k.csv --target response --memory 1
      TRAINING_ID:            training-TOnw5SGiR
    Mounts:
      /etc/certs/ from etcd-ssl-cert-vol (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nllw4 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  etcd-ssl-cert-vol:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  lcm-secrets
    Optional:    false
  default-token-nllw4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nllw4
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age               From               Message
  ----     ------                 ----              ----               -------
  Normal   Scheduled              6m                default-scheduler  Successfully assigned grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 to minikube
  Normal   SuccessfulMountVolume  6m                kubelet, minikube  MountVolume.SetUp succeeded for volume "etcd-ssl-cert-vol"
  Normal   SuccessfulMountVolume  6m                kubelet, minikube  MountVolume.SetUp succeeded for volume "default-token-nllw4"
  Normal   Pulling                5m (x4 over 6m)   kubelet, minikube  pulling image "docker.io/ffdl/parameter-server:master-97"
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Error: ErrImagePull
  Warning  Failed                 5m (x6 over 6m)   kubelet, minikube  Error: ImagePullBackOff
  Normal   BackOff                1m (x21 over 6m)  kubelet, minikube  Back-off pulling image "docker.io/ffdl/parameter-server:master-97"

I think this is what is blocking the rest of the processes:

  Normal   Pulling                5m (x4 over 6m)   kubelet, minikube  pulling image "docker.io/ffdl/parameter-server:master-97"
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Error: ErrImagePull
  Warning  Failed                 5m (x6 over 6m)   kubelet, minikube  Error: ImagePullBackOff
  Normal   BackOff                1m (x21 over 6m)  kubelet, minikube  Back-off pulling image "docker.io/ffdl/parameter-server:master-97"

about kubernetes cluster manager and release scheduler

@FfDL
I have read the paper about FfDL. http://learningsys.org/nips17/assets/papers/paper_29.pdf. In the FfDL architecture, Kubernetes Cluster Manager exist as a kubernetes controller. Are there some detail resources about the service, or Would you please tell us the future release plan.
We think it will be better that the development schedule is available.

FfDL logging issues with Elastic Search

The FfDL Elastic Search sometimes has an overhead issue when creating the emetrics/logline mapping.

[2018-02-15T18:11:21,334][INFO ][o.e.c.m.MetaDataMappingService] [VPF0eed] [dlaas_learner_data/Of58W91xS-6OlsuQEGByzw] update_mapping [logline]
[2018-02-15T18:11:36,852][INFO ][o.e.m.j.JvmGcMonitorService] [VPF0eed] [gc][556] overhead, spent [258ms] collecting in the last [1s]

When the Elastic Search works properly, it should have the following logs for mapping update/create.

[2018-02-15T17:52:52,598][INFO ][o.e.c.m.MetaDataMappingService] [R7H6R6o] [dlaas_learner_data/d-NMzvwRT_CXgMwHOBinTg] update_mapping [logline]
[2018-02-15T17:54:51,289][INFO ][o.e.c.m.MetaDataMappingService] [R7H6R6o] [dlaas_learner_data/d-NMzvwRT_CXgMwHOBinTg] create_mapping [emetrics]

Potential issue for ffdl-databroker_s3/caffe-model job

When running the caffe-model job with FfDL. The databroker_s3 always having issue to pull one of the file from s3://mnist_lmdb_data/train/data.mdb

Using Object Storage account test at http://s3.default.svc.cluster.local
Download start: Mon Feb 12 19:38:10 UTC 2018
Downloading from bucket mnist_lmdb_data to /job/mnist_lmdb_data
Completed 256.0 KiB/68.8 MiB (213.3 KiB/s) with 4 file(s) remaining
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 4 file(s) remaining
download: s3://mnist_lmdb_data/test/lock.mdb to job/mnist_lmdb_data/test/lock.mdb
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 3 file(s) remaining
Completed 520.0 KiB/68.8 MiB (259.9 KiB/s) with 3 file(s) remaining
Completed 776.0 KiB/68.8 MiB (369.5 KiB/s) with 3 file(s) remaining
Completed 1.0 MiB/68.8 MiB (469.2 KiB/s) with 3 file(s) remaining  
Completed 1.3 MiB/68.8 MiB (585.1 KiB/s) with 3 file(s) remaining  
Completed 1.5 MiB/68.8 MiB (701.2 KiB/s) with 3 file(s) remaining  
Completed 1.8 MiB/68.8 MiB (817.2 KiB/s) with 3 file(s) remaining  
Completed 2.0 MiB/68.8 MiB (933.2 KiB/s) with 3 file(s) remaining  
Completed 2.3 MiB/68.8 MiB (825.7 KiB/s) with 3 file(s) remaining  
Completed 2.5 MiB/68.8 MiB (916.3 KiB/s) with 3 file(s) remaining  
Completed 2.8 MiB/68.8 MiB (973.5 KiB/s) with 3 file(s) remaining  
Completed 3.0 MiB/68.8 MiB (1.0 MiB/s) with 3 file(s) remaining    
Completed 3.3 MiB/68.8 MiB (1.1 MiB/s) with 3 file(s) remaining    
Completed 3.5 MiB/68.8 MiB (1.2 MiB/s) with 3 file(s) remaining    
Completed 3.8 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining    
Completed 4.0 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining    
Completed 4.3 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining    
Completed 4.5 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining    
Completed 4.8 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining    
Completed 5.0 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining    
Completed 5.3 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining    
Completed 5.5 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 5.6 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 5.9 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining    
Completed 6.1 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 6.4 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 6.6 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
Completed 6.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 7.1 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
Completed 7.4 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 7.6 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining    
Completed 7.9 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining    
Completed 8.1 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining    
Completed 8.4 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining    
Completed 8.6 MiB/68.8 MiB (2.2 MiB/s) with 3 file(s) remaining    
Completed 8.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 8.9 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
download: s3://mnist_lmdb_data/train/lock.mdb to job/mnist_lmdb_data/train/lock.mdb
Killed
Killed
download failed: s3://mnist_lmdb_data/train/data.mdb to job/mnist_lmdb_data/train/data.mdb [Errno 12] Cannot allocate memory

I also tried to increase the job memory and use IBM Cloud Object storage and still have the same issue. So I believe the issue could be

https://github.com/albarji/caffe-demos/blob/master/mnist/mnist_train_lmdb/data.mdb file is corrupted and we should use a different dataset.
or
ffdl-databroker_s3 may have a bug when pulling certain file.

external NFS storage support

Hello, @FfDL
We deploy FfDL in a private environment in which S3 and Swift are not available, only support NFS external storage. for model definition file, we can use localstack in current dev environment, for training data, we wish use NFS.
The following steps are our adaptions for NFS.

Deploy an external NFS server out of kubernetes.
Add PVs declaration in templates folder
Add PVCs file "/etc/static-volumes/PVCs.yaml" in LCM docker environment

We are confirming the above method, however, new question already occurred.
If there are two models to be submitted, they are all using NFS static external storage at the same mount point, is this not a problem?

Would you please confirm the above method and question, or provide a right solution to us.

Thanks

Uber Horovod Testing

Currently we need an image to test on CPU

Unable to mount volumes for pod Learner

What happend:
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your docs but in the different condition, which means:

my local minikube encountered the issue which was recorded in the DIND-TRAING -- all pods worked as expected

alertmanager-7bd87d99cc-jhp2b                                     1/1       Running             0          6h
etcd0                                                             1/1       Running             0          6h
ffdl-lcm-8d555c7bf-dqqhg                                          1/1       Running             0          6h
ffdl-restapi-7f5c57c77d-k67pm                                     1/1       Running             0          6h
ffdl-trainer-6777dd5756-xkk65                                     1/1       Running             0          6h
ffdl-trainingdata-696b99ff5c-tvbtc                                1/1       Running             0          6h
ffdl-ui-95d6464c7-bv2sn                                           1/1       Running             0          6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz   1/1       Running             0          1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0                    0/1       ContainerCreating   0          1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc      2/2       Running             0          1h
mongo-0                                                           1/1       Running             4          6h
prometheus-67fb854b59-c884p                                       2/2       Running             0          6h
pushgateway-5665768d5c-jdlnl                                      2/2       Running             0          6h
storage-0                                                         1/1       Running             0          6h

except the pod learner with eternal pending status because of the following warning.

Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]

and here's the details of pod learner-x

Name:               learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/10.0.2.15
Start Time:         Thu, 06 Dec 2018 17:05:52 +0100
Labels:             controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
                    service=dlaas-learner
                    statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
                    training_id=training-bFEXXGPmR
                    user_id=test-user
Annotations:        scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
                    scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status:             Pending
IP:
Controlled By:      StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
  learner:
    Container ID:
    Image:         tensorflow/tensorflow:1.5.0-py3
    Image ID:
    Ports:         22/TCP, 2222/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      bash
      -c
      export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
                        if [ ! -f /job/load-model.exit ]; then
                          while [ ! -f /job/load-model.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;

                        echo "Starting Training $TRAINING_ID"
                        mkdir -p "$MODEL_DIR" ;
                        python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;
                          echo $? > /job/load-model.exit ;
                        fi
                        echo "Done load-model" ;
                        if [ ! -f /job/learner.exit ]; then
                          while [ ! -f /job/learner.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/learner.start_time ;

                        for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
                        export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
                        mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
                        mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
                          echo $? > /job/learner.exit ;
                        fi
                        echo "Done learner" ;
                        if [ ! -f /job/store-logs.exit ]; then
                          while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;

                        echo Calling copy logs.
                        mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
                        ERROR_CODE=$? ;
                        echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
                        bash -c 'exit $ERROR_CODE' ;
                          echo $? > /job/store-logs.exit ;
                        fi
                        echo "Done store-logs" ;
                      while true; do sleep 2; done ;
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Requests:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Environment:
      LOG_DIR:                     /job/logs
      GPU_COUNT:                   0.000000
      TRAINING_COMMAND:            python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters 2000
      TRAINING_ID:                 training-bFEXXGPmR
      DATA_DIR:                    /mnt/data/tf_training_data
      MODEL_DIR:                   /job/model-code
      RESULT_DIR:                  /mnt/results/tf_trained_model/training-bFEXXGPmR
      DOWNWARD_API_POD_NAME:       learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
      DOWNWARD_API_POD_NAMESPACE:  default (v1:metadata.namespace)
      LEARNER_NAME_PREFIX:         learner-0d296791-2adc-4336-4f01-b280090460c3
      TRAINING_ID:                 training-bFEXXGPmR
      NUM_LEARNERS:                1
      JOB_STATE_DIR:               /job
      CHECKPOINT_DIR:              /mnt/results/tf_trained_model/_wml_checkpoints
      RESULT_BUCKET_DIR:           /mnt/results/tf_trained_model
    Mounts:
      /entrypoint-files from learner-entrypoint-files (rw)
      /job from jobdata (rw)
      /mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
      /mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
  cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
  learner-entrypoint-files:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      learner-entrypoint-files
    Optional:  false
  jobdata:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     dedicated=gpu-task:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Warning  FailedMount  1m (x40 over 1h)  kubelet, minikube  Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]

my local dind encountered the issue with non-hint FAILED ERROR while training. All the pods was running, but there're no pods jobmonitor, learner and lhelper.

Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OK

What you expected to happen:
Make FfDL work as properly on either local DIND or MINIKUBE.

Environment:
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce it (as minimally and precisely as possible):

I was just following README.rd with several make instructions

make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submit

Anything else we need to know?:

In situation 2, I totally followed the above-mentioned steps;
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports hostpath type, so I created a PV and PVC, here's the details.

$ kubectl describe pv hostpathtest
Name:            hostpathtest
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           default/static-volume-1
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        20Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/hostpath_test
    HostPathType:
Events:            <none>

$ kubectl describe pvc learner-1
Name:          learner-1
Namespace:     default
StorageClass:
Status:        Bound
Volume:        hostpathtest-learner
Labels:        type=dlaas-static-volume
Annotations:   kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
               pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-class=
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
Events:        <none>

Thanks in advance for all advices and have a good day

Add docker requirements in the docs for seldon model in FfDL

When image is built for seldon model using s2i, minimum required memory for docker is 8G.

Add these details in https://github.com/IBM/FfDL/blob/master/community/FfDL-Seldon/pytorch-model/README.md

Seldon Intergration

Using Seldon as an optional model deployment platform for FfDL.

Deploy FfDL in a dedicated namespace

Currently FfDL only can be deployed in the default namespace. We need to add some configuration for the helm chart and LCM provision to allow FfDL to be deployed on any namespace.

FfDL v0.1.1 model training error

model trained as command：
$CLI_CMD train etc/examples/tf-model/manifest-hostmount.yml etc/examples/tf-model

hostmount learner pod error as flow：
Starting Training training-PYCOsfJmg
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib/python2.7/zipfile.py", line 1541, in <module>
main()
File "/usr/lib/python2.7/zipfile.py", line 1512, in main
with ZipFile(args[1], 'r') as zf:
File "/usr/lib/python2.7/zipfile.py", line 756, in init
self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: '/mnt/results/results/training-PYCOsfJmg/_submitted_code/model.zip'
Done load-model

Readiness probe failed

the pod prometheus' status is CrashLoopBackOfff

Readiness probe failed: Get http://192.168.7.81:3000/api/health: dial tcp 192.168.7.81:3000: getsockopt: connection refused

Power support in FfDL

Work in progress

PyTorch code from Object storage download not being defined correctly

Compare to Kubeflow which contains Seldon-Core

Hi! thanks for open sourcing this big effort!

Would it be possible to compare this solution to Kubeflow which contains Seldon-Core and an example.

And finally, if you have some time, compare to PipelineAI ?

Prepare TensorFlow and Caffe sample GPU jobs.

Note: Currently GPU workloads on FfDL only works with Feature gate accelerator.

We need to prepare for some sample TensorFlow and Caffe jobs that uses GPU.

To run TensorFlow and Caffe GPU workloads, we need to build the LCM image with bvlc/caffe:gpu and tensorflow/tensorflow:latest-gpu enable under lcm/service/lcm/container_helper_extensions.go

Change the default policy to pull a versioned image from docker hub

use a tagged image, like ":master-36" - that tagging should be part of ci/cd

helm install issue with one pod getting into CrashLoopBackOff

I am getting error with one pod that gets into CrashLoopBackoff status. I have tried a few times with clean 'helm install' but get the same error.

ffdl-trainingdata-86c5578b75-v884m 0/1 CrashLoopBackOff

5 7m

Any suggestion, how to get past this error and have a clean deployment of this fabric? Thanks

Direct data download from hosting sites

FfDL should be able to handle direct data downloads from standard sites and upload to S3 compatible object storage

Please include Julia language for data science and high performance computing

How do we retrain based on the data which is produced as part of an initial training?

Need to clarify in documentation

Parameterize DIND scripts

While originally only intended as a quick help to setup FfDL's dependencies, the DIND scripts have become more widely used than anticipated. Thus, we should probably overhaul and parameterize them. Here are a few notes for improvement:

https://github.com/IBM/FfDL/blob/master/bin/dind_scripts/create_user.sh should have an environment variable for the username, so it can be adapted in one place
https://github.com/IBM/FfDL/blob/master/bin/dind_scripts/experimental_master.sh line 2 and 24 should use $USER, so this is not tied to the user being ffdlr
We should make sure $GOPATH is used consistently, so users who do not use ~/go can adapt it
The scripts should probably check more rigorously whether they were successful and potentially create reports, so tracing issues during setup becomes easier.
The current deploy-plugin target has side effects, i.e. it not only pulls and deploys the S3 driver plugin, but it also sets up a hostmount PV, which is not the standard for development and should probably be done in an independent task. [Also: Isn't that an NFS replacement, whereas the rest is about S3?]
The test-submit Makefile target will not work when used against existing S3 buckets, since it uses placeholder names for the buckets and endpoint. We either need to replace those like we replace the username and password or we need to halt the experimental_master.sh script and tell the user to adapt the etc/examples/tf-model/manifest.yml file before running make test-submit
Overall, we might want to turn the scripts into a full installer that queries for username, target directory etc., we should also add macOS support and potentially allow for more customization (e.g. use existing container registry rather than local registry)

Add new RBAC for LCM to support Kubernetes 1.9 and above

From Kubernetes 1.9 and above, the default role doesn't have permission to access and consume cluster resources, so we need to create a new RBAC for LCM to view and assign cluster resources to the learners. The new RBAC should be done as part of the helm install.

`132	1534166953114	pciBusID: 0000:00:06.0
133	1534166953115	totalMemory: 22.40GiB freeMemory: 22.29GiB
134	1534166953116	2018-08-13 13:29:02.319076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:00:06.0, compute capability: 5.2)
135	1534166953117	/usr/local/bin/train.sh: line 18: 36 Killed python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 2000 2>&1
136	1534166953118	Training process finished. Exit code: 137
137	1534166953119	Job exited with error code 137
138	1534166953120	Failed: learner_exit_code: 137`

Adding support for Caffe2 and PyTorch

@Tomcli based on your investigation what`s needed on this side?

How to get model training process? such as 20%, 80%.

For some case, It will take long time to training a model, a "Processing" status is not enough for user experience. We will provide training progress and left time estimation.

Thanks

ibm / ffdl Goto Github PK

ffdl's Introduction

Fabric for Deep Learning (FfDL)

Prerequisites

Usage Scenarios

Steps

1. Quick Start

1.1 Installation using Kubernetes Cluster

1.2 Installation using Kubeadm-DIND

2. Test

3. Monitoring

4. Development

5. Clean Up

6. Troubleshooting

7. References

ffdl's People

Contributors

Stargazers

Watchers

Forkers

ffdl's Issues

Code change causing the regression:

Command output showing the behavior:

Integration Proposal

Documented commands do not show all how to setup FfDL from scratch

Recommend Projects

Recommend Topics

Recommend Org