Distributed TensorFlow training jobs require managing multiple sets of TensorFlow replicas. Each set of replicas usually has a different role in the job. For example, one set acts as parameter servers, another provides workers and another provides a controller.
K8s makes it easy to configure and deploy each set of TF replicas. Various tools like helm and ksonnet can be used to simplify generating the configs for a TF job.
However, in addition to generating the configs we need some custom control logic because K8s built-in controllers (Jobs, ReplicaSets, StatefulSets, etc...) don't provide the semantics needed for managing TF jobs.
To solve this we define a K8S Third Party Resource and Operator to manage a TensorFlow job on K8s.
TfJob provides a K8s resource representing a single, distributed, TensorFlow job. The Spec and Status (defined in tf_job.go) are customized for TensorFlow. The spec allows specifying the Docker image and arguments to use for each TensorFlow replica (i.e. master, worker, and parameter server). The status provides relevant information such as the number of replicas in various states.
Using a TPR gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to create a job
kubectl create -f examples/tf_job.yaml
To list jobs
kubectl get tfjobs
NAME KINDS
example-job TfJob.v1beta1.mlkube.io
The code is closely modeled on Coreos's etcd-operator.
The TfJob Spec(defined in tf_job.go) reuses the existing Kubernetes structure PodTemplateSpec to describe TensorFlow processes. We use PodTemplateSpec because we want to make it easy for users to configure the processes; for example setting resource requirements or adding volumes. We expect helm or ksonnet could be used to add syntactic sugar to create more convenient APIs for users not familiar with Kubernetes.
Leader election allows a K8s deployment resource to be used to upgrade the operator.
-
Clone the repository
git clone https://cloud-ml.googlesource.com/jlewi-training
-
Deploy the operator
kubectl create -f ./images/tf_operator/tf_job_operator_deployment.yaml
-
Make sure the operator is running
kubectl get pods NAME READY STATUS RESTARTS AGE tf-job-operator-3083500267-wxj43 1/1 Running 0 48m
A simplistic TF program is in the directory tf_sample.
-
Start the example
kubectl create -f examples/tf_job.yaml
-
Check the job
kubectl get tfjobs -o yaml
This is very much a prototype.
Logging still needs work.
We'd like to tag log entries with semantic information suitable for TensorFlow. For example, we'd like to tag entries with metadata indicating the replica that produced the log entry. There are two issues here
-
Tagging Tensorflow entries with appropriate semantic information
- Usinge Python sitecustomize.py might facilitate injecting a custom log handler that outputs json entries.
- For parameter servers, we might want to just run the TensorFlow standard server and its not clear how we would convert those logs to json.
-
Integrate with Kubernetes cluster level logging.
- We'd like the logs to integrate nicely with whatever cluster level logging users configure.
- For example, on GCP we'd like the log entries to be automatically streamed to Stackdriver and indexed by the TensorFlow metadata to facilitate querying e.g. by replica.
- GCP's fluentd logger is supposed to automatically handle JSON logs
Additionally, we'd like TensorFlow logs to be available via
kubectl logs
So that users don't need to depend on cluster level logging just to see basic logs.
In the current implementation, pods aren't deleted until the TfJob is deleted. This allows standard out/error to be fetched via kubectl. Unfortunately, this leaves PODs in the RUNNING state when the TfJob is marked as done which is confusing.
The status information reported by the operator is hacky and not well thought out. In particular, we probably need to figure out what the proper phases and conditions to report are.
The semantics for aggregating status of individual replicas into overall TfJob status needs to be thought out.
There is a lot of code from earlier versions (including the ETCD operator) that still needs to be cleaned up.
Tests are out of date and broken.
To build the agent pull in client-go
```
go get k8s.io/client-go/...
```
Create a symbolic link inside your GOPATH to the location you checked out the code
```
ln -sf ${GIT_TRAINING}/src/mlkube.io ${GOPATH}/src/
```
- GIT_TRAINING should be the location where you checked out https://cloud-ml.googlesource.com/training/
Build it
go install mlkube.io/cmd/tf_operator
Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with a K8s cluster.
Set your environment
export USE_KUBE_CONFIG=$(echo ~/.kube/config)
export MY_POD_NAMESPACE=default
export MY_POD_NAME=my-pod
TODO(jlewi): Do we still need to set MY_POD_NAME? Why?
On ubuntu the default go package appears to be gccgo-go which has problems see issue golang-go package is also really old so install from golang tarballs instead.