azure / azure-databricks-operator Goto Github PK

View Code? Open in Web Editor NEW

113.0 16.0 47.0 77.47 MB

Kubernetes Operator for Databricks

License: MIT License

Dockerfile 2.16% Makefile 3.17% Go 87.72% Shell 0.73% Python 6.22%

azure-databricks-operator's Introduction

Azure Databricks operator (for Kubernetes)

This project is experimental. Expect the API to change. It is not recommended for production environments.

Introduction

Kubernetes offers the facility of extending its API through the concept of Operators. This repository contains the resources and code to deploy an Azure Databricks Operator for Kubernetes.

The Databricks operator is useful in situations where Kubernetes hosted applications wish to launch and use Databricks data engineering and machine learning tasks.

Key benefits of using Azure Databricks operator

Easy to use: Azure Databricks operations can be done by using Kubectl there is no need to learn or install data bricks utils command line and it’s python dependency
Security: No need to distribute and use Databricks token, the data bricks token is used by operator
Version control: All the YAML or helm charts which has azure data bricks operations (clusters, jobs, …) can be tracked
Automation: Replicate azure data bricks operations on any data bricks workspace by applying same manifests or helm charts

The project was built using

How to use Azure Databricks operator

Download the latest release manifests:

wget https://github.com/microsoft/azure-databricks-operator/releases/latest/download/release.zip
unzip release.zip

Create the azure-databricks-operator-system namespace:

kubectl create namespace azure-databricks-operator-system

Create Kubernetes secrets with values for DATABRICKS_HOST and DATABRICKS_TOKEN:

kubectl --namespace azure-databricks-operator-system \
    create secret generic dbrickssettings \
    --from-literal=DatabricksHost="https://xxxx.azuredatabricks.net" \
    --from-literal=DatabricksToken="xxxxx"

Apply the manifests for the Operator and CRDs in release/config:

kubectl apply -f release/config

For details deployment guides please see deploy.md

Samples

Create a spark cluster on demand and run a databricks notebook.

Create an interactive spark cluster and Run a databricks job on exisiting cluster.

Create azure databricks secret scope by using kuberentese secrets

For samples and simple use cases on how to use the operator please see samples.md

Quick start

On click start by using vscode

For more details please see contributing.md

Roadmap

Check roadmap.md for what has been supported and what's coming.

Resources

Few topics are discussed in the resources.md

Dev container
Build pipelines
Operator metrics
Kubernetes on WSL

Contributing

For instructions about setting up your environment to develop and extend the operator, please see contributing.md

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azure-databricks-operator's People

Contributors

Stargazers

Watchers

Forkers

kant xinsnake xtellurian phanvinh0526 priyakumarank jasonthedeveloper e-lain leantk stuartleeks azadehkhojandi mydiemho lawrencegripper szoio storey247 chaitanyak52 amrasem tomgeorge joshagudo martinpeck roeeshlomo eliises drmcghee iggy3 ravihari bhaskers-blu-org2 cipahi taffywrinkle claudiusgonzo zhaohc10 xaviercallens mpv buhongw7583c manisbindra norazhaoo samagius ibraheemalsaady jacksonqaprod ccodreanu stevenzx1988 davidjfarr mthkunze isabella232 sivanantha321 sourcecodecheck jamiepan mikesecurity

azure-databricks-operator's Issues

Bug: Nil Check on `initial_manage_permission`

initial_manage_permission is optional. There's no check if initial_manage_permission is set in the config or not.

Crash when submitting djob and run simultaneously with nil pointer dereference

If you submit a djob and run simultaneously the operator will crash with a nil pointer dereference error. Submitting simultaneously is our use-case that unfortunately we cannot avoid.

The desired behaviour would be similar to the handling of secretscopes when the underlying secret does not exist. Rather than crashing, report an error in the logs and continue the next reconcile cycle.

To reproduce given the following run referencing a djob:

apiVersion: databricks.microsoft.com/v1alpha1
kind: Djob
metadata:
  name: device-pessl
  namespace: dx
spec:
  new_cluster:
    spark_version: 5.3.x-scala2.11
    spark_conf:
      spark.databricks.delta.preview.enabled: "true"
    node_type_id: Standard_DS3_v2
    spark_env_vars:
      PYSPARK_PYTHON: '/databricks/python3/bin/python3'
    num_workers: 1
  notebook_task:
    notebook_path: "/Shared/notebooks/stream_builder-2.24.0"
  max_retries: 3
 
---
apiVersion: databricks.microsoft.com/v1alpha1
kind: Run
metadata:
  name: device-pessl-run
  namespace: dx
spec:
  job_name: device-pessl
  notebook_params:
    job_name: device-pessl

run:
kubectl apply -f job_and_run.yaml

output:

2019-12-17T10:03:27.791+1100	INFO	controllers.Djob	Starting reconcile loop for dx/device-pessl
2019-12-17T10:03:27.791+1100	INFO	controllers.Djob	Submit for dx/device-pessl
2019-12-17T10:03:27.791+1100	INFO	controllers.Djob	Submitting job device-pessl
2019-12-17T10:03:27.821+1100	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "run", "request": "dx/device-pessl-run"}
2019-12-17T10:03:27.822+1100	INFO	controllers.Run	Submitting run device-pessl-run
2019-12-17T10:03:27.821+1100	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Run","namespace":"dx","name":"device-pessl-run","uid":"d94ff46e-6f39-4759-a6ed-3a18525fbdeb","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"56723"}, "reason": "Added", "message": "Object finalizer is added"}
E1217 10:03:27.822531   47051 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 357 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x201b080, 0x3062de0)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x201b080, 0x3062de0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).submit(0xc000290240, 0xc000278000, 0x1, 0x21ce575)
	/Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller_databricks.go:71 +0x530
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).Reconcile(0xc000290240, 0xc0000cacfa, 0x2, 0xc0000cace0, 0x10, 0x307c760, 0x0, 0xc000054540, 0xc000105de8)
	/Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller.go:80 +0x272
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001980c0, 0x20687a0, 0xc0001ceea0, 0x2068700)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:256 +0x146
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001980c0, 0xc0005b2100)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001980c0)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005f5330)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005f5330, 0x3b9aca00, 0x0, 0x1, 0xc0000ae300)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005f5330, 0x3b9aca00, 0xc0000ae300)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:193 +0x326
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e93220]

goroutine 357 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x105
panic(0x201b080, 0x3062de0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).submit(0xc000290240, 0xc000278000, 0x1, 0x21ce575)
	/Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller_databricks.go:71 +0x530
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).Reconcile(0xc000290240, 0xc0000cacfa, 0x2, 0xc0000cace0, 0x10, 0x307c760, 0x0, 0xc000054540, 0xc000105de8)
	/Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller.go:80 +0x272
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001980c0, 0x20687a0, 0xc0001ceea0, 0x2068700)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:256 +0x146
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001980c0, 0xc0005b2100)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001980c0)
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005f5330)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005f5330, 0x3b9aca00, 0x0, 0x1, 0xc0000ae300)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005f5330, 0x3b9aca00, 0xc0000ae300)
	/Users/d886442/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/d886442/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:193

Delays in reconciliation under load

(Pre-emptive shout-out to @EliiseS, @storey247 and @lawrencegripper as this work has been a group effort)
As mentioned in #131 we have been performing some load tests against the operator. Our initial load run shows raised work-queue latency and an increasing work-queue depth.

It's worth noting that the histogram buckets for the latency are 0.1s, 1s, 10s, so a value of 10 on the graph in effect means somewhere between 1s and 10s.

Looking at the metrics for the mock api that we're using for the load tests, the reponse times for that look pretty constant:

What we can see in the mock api metrics are periods of time where there are no requests being made to the API (and these become more pronounced as the test load ramps up).

Looking at this, our hypothesis was that there is something causing the reconciliation loops to block.

Setting the djob as a parent for the relevant Runs

Removing the parent should delete the child.
There is a dependency between Run and djob. If you remove djob operator should delete related run

Unable to delete secretscope in k8s if we fail to submit ACLs when creating it

When using the operator to create a secretscope with some secrets and ACLs, if we fail to set the ACLs for some reason, we won't be able to delete the secretscope from k8s. We will need to go to Databricks to delete the secretscope.

In my particular case, submitting the ACLs fail because my Databricks workspace doesn't support ACLs as it's not a Premium SKU. This is the error I get in the operator:

DEBUG controller-runtime.manager.events Warning {"object": {"kind":"SecretScope","namespace":"kubeflow","name":"test-secretscope","uid":"864a1362-218c-11ea-b7de-4a5fce98e05a","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"11666051"}, "reason": "Failed", "message": "Failed to submit object: Response from server (403) {\"error_code\":\"PERMISSION_DENIED\",\"message\":\"ACL is not supported in your workspace.\"}"}

The problem can be seen in https://github.com/microsoft/azure-databricks-operator/blob/master/controllers/secretscope_controller_databricks.go. If "r.submitACLs(instance)" fails we will just return (L210) and we will never set instance.Status.SecretScope properly (L220). But we need instance.Status.SecretScope to be able to delete the secretscope from k8s (L226).

There could be at least 2 possible behaviors here:

If we fail to set the ACLs, we rollback everything we did so we don't end up with a secretscope with invalid security.
We ignore the error but we still set instance.Status.SecretScope so we can at least delete it from k8s.

Sample notebook in README doesn't work

When running the sample notebook, it fails with error: "Secret Scope does not exist."

The secret scope is actually constructed from the run_name

Modeling Databricks API with Custom Resource Definitions

Megathread

Current state

NotebookJobs are the first (and currently only) custom resource definition (CRD) in this project. The Databricks API is quite rich, and cannot be modeled completely by this one object.

Desired state

A set of CRDs that can be used to model many or all operations that can be performed on the Databricks API.

Modeling the Databricks API

This thread is for discussion on what CRDs can be introduced in order to model the Databricks API 2.0

For reference, the databricks API is documented here

Cluster controller contains logic that can potentially delete a cluster in use

Looking at the code for creation of Clusters using the databricks operator there seems to be logic that could potentially delete a Cluster: https://github.com/microsoft/azure-databricks-operator/blob/master/controllers/dcluster_controller_databricks.go#L32

This seems incorrect, and indeed the business logic in the main controller seems to deem that this code can never be hit due it !IsSubmitted being the only way this code can get hit.

Suggest removing the code that deletes the cluster as this seems extremely dangerous. The last thing we would want is a cluster to be deleted if it is running a job/jobs etc.

Adding -o wide Support, show the Status of K8s custom resource objects

Add cluster name support for dbricks run

currently, you can submit run on existing cluster by providing existing_cluster_id

apiVersion: databricks.microsoft.com/v1alpha1
kind: Run
metadata:
  name: drun-twitteringest1
spec:
  # create a run directly without a job
  existing_cluster_id: 1021-013622-bused793

You should be able to submit run on existing cluster by providing existing_cluster_name

apiVersion: databricks.microsoft.com/v1alpha1
kind: Run
metadata:
  name: drun-twitteringest1
spec:
  # create a run directly without a job
  existing_cluster_name: dcluster-interactive2

Investigate to resolve "error": "error when refreshing cluster: unexpected end of JSON input"

investigate on below error, it pollutes our logs. It doesn't break the operator and expected request executed successfully.

2019-10-15T02:09:42.255Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "dcluster", "request": "dx/interactive-always-on-nopassthrough", "error": "error when refreshing cluster: unexpected end of JSON input"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

Add * permission to Events

Add events for errors

After a quick glance at the controllers, it seems that they are adding events for resources on successful steps which is awesome. For troubleshooting it would also be helpful to output events when there are error conditions

Consolidate Python API into native Golang operator code

Currently, we have Python and Golang as 2 separate applications and therefore containers. There's an overhead maintaining 2 set of development environments and the communication between using Swagger.

Work is under progress to re-implement the Python logic into Golang to eventually make everything one container.

Add missing manager_auth_proxy_patch

Use semantic versioning and tag docker images

Use semantic versioning and ensure docker images are tagged with the version so they are easy to find. Currently just has a hash.

Load testing the Operator - early output

With the work on #104 to add Metrics into Prometheus we've been able to combine this with a locust loadtest and the run some early tests with a mocked databricks API and graph the k8s-api-server, locust-metrics and databricks-operator-metrics to see how things scale.

Very early days but I'd be interested to know how much of this work you'd like to see contributed back to the project.

Shout out to the team for their work here @stuartleeks @EliiseS @martinpeck and @storey247

Updating manifest for djobs is not working as expected.

to reproduce:
create a djob
update the notebook path
It doesn't update the notebook path but shows the successful message (it can be at Datbricks API level or Operator level)

LICENSE missing as it was removed in Pull Request 34

The LICENSE file was removed when merging PR 34, and needs to be reinstated

#34

Delete SecretScope Api object doesn't delete SecretScope in databricks

If you delete the SecretScope Api object
https://australiaeast.azuredatabricks.net/api/2.0/secrets/scopes/list Still shows that SecretScope

if you call https://australiaeast.azuredatabricks.net/api/2.0/secrets/list?scope=dsecretscope-twitters it returns empty json object {} instead of

{
    "error_code": "RESOURCE_DOES_NOT_EXIST",
    "message": "Scope xxxx does not exist!"
}

Error on secretscope delete/update

Hi,

I've been trying to delete and update secretScope and run into issue.
first update:
Rbac is missing some permission to patch the existing

'events "***" is forbidden: User "system:serviceaccount:azure-databricks-operator-system:default" cannot patch resource "events" in API group "" in the namespace "****"' (will not retry!)

Second delete:
Cascade of events after a kubectl delete secretscope
it goes in a CrashLoop at the operator pod level.
only a delete of the pod will bring it back (plus some manual cleanup)
To get rid of the secretscope in kube I have to delete the finalizer

2019-10-01T01:07:41.687Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "secretscope", "request": "***/*********", "error": "error when submitting secret scope to the API: Response from server (400) {\"error_code\":\"RESOURCE_ALREADY_EXISTS\",\"message\":\"Scope ***** already exists!\"}"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

2019-10-01T01:07:42.687Z	INFO	controllers.SecretScope	Finish reconcile loop for dx/lbr-alarm-95986
E1001 01:07:42.687910       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:76
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:65
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/workspace/controllers/secretscope_controller_databricks.go:193
/workspace/controllers/secretscope_controller_finalizer.go:37
/workspace/controllers/secretscope_controller.go:66
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1155031]

goroutine 320 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:58 +0x105
panic(0x12cf380, 0x215a8a0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).delete(0xc000222240, 0xc006ba5680, 0x1, 0x1470e27)
	/workspace/controllers/secretscope_controller_databricks.go:193 +0x41
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).handleFinalizer(0xc000222240, 0xc006ba5680, 0xc00041a120, 0xc00626b170)
	/workspace/controllers/secretscope_controller_finalizer.go:37 +0x95
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).Reconcile(0xc000222240, 0xc00349455a, 0x2, 0xc003494540, 0xf, 0x216e900, 0x0, 0x0, 0x0)
	/workspace/controllers/secretscope_controller.go:66 +0x7af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000206000, 0x131a4e0, 0xc004bbca40, 0x131a400)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x149
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000206000, 0xc000add300)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000206000)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005e90a0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005e90a0, 0x3b9aca00, 0x0, 0xc000000001, 0xc0004ae9c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005e90a0, 0x3b9aca00, 0xc0004ae9c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x311

Errors creating cluster

We see these errors reconciling clusters:

2019-11-24T23:19:58.371Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "dcluster", "request": "dx/interactive-always-on-nopassthrough", "error": "err
or when refreshing cluster: unexpected end of JSON input"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

Here is the YAML:

apiVersion: databricks.microsoft.com/v1alpha1
kind: Dcluster
spec:
  autotermination_minutes: 120
  cluster_name: interactive-always-on-nopassthrough
  custom_tags:
  - key: ResourceClass
    value: Serverless
  driver_node_type_id: Standard_D3_v2
  enable_elastic_disk: true
  init_scripts:
  - dbfs:
      destination: dbfs:/databricks/init_scripts/openssl_fix.sh
  - dbfs:
      destination: dbfs:/databricks/init_scripts/log_analytics/listeners.sh
  node_type_id: Standard_D3_v2
  num_workers: 5
  spark_conf:
    spark.databricks.cluster.profile: serverless
    spark.databricks.delta.preview.enabled: "true"
    spark.databricks.repl.allowedLanguages: sql,python
    spark.hadoop.hive.server2.enable.doAs: "false"
  spark_env_vars:
    PYSPARK_PYTHON: /databricks/python3/bin/python3
  spark_version: latest-stable-scala2.11
status:
  cluster_info:
    cluster_cores: "0"
    cluster_id: 1113-043420-odors235

Add helm chart Release

Add helm chart release

generating random string consolidation

There's a helpers.go under api/alphav1 with funcs for generating random string characters but there's also randomStringWithCharset in controllers/suit_test.go. It might be good to rip it out and put helpers.go in a folder accessible to all tests.

Add databricks golang sdk Mock

Currently, we have integration tests as part of our tests we call databricks API and create resources.

We should be to isolate k8s controllers and Group API types and be able to test them without calling databricks API.

Update Kubebuilder version

Ability to get run_page_url for the Run with Existing Cluster

If you type Kubectl get runs or describe the specific run you cannot get this value.

From Databricks UI you cannot access to specific Run UI,
You need to call https://australiaeast.azuredatabricks.net/api/2.0/jobs/runs/list or https://australiaeast.azuredatabricks.net/api/2.0/jobs/runs/get-output?run_id=xxx and find "run_page_url": "https://australiaeast.azuredatabricks.net/?o=xxxx#job/xxx/run/xxx"

memory leak - oomkilled regularly

even with 1Gig of memory for limit, we get oomkilled regularly

manager:
    Container ID:  docker://62d43076d560f831bd01c41766444aa7d4795f5433b45ac88d8f1fcad2c423ef
    Image:         mcr.microsoft.com/k8s/azure-databricks/operator:7bb0a68096d6e32f78ebffca6c5c3f5e507eff8e
    Image ID:      docker-pullable://mcr.microsoft.com/k8s/azure-databricks/operator@sha256:dc10d4b0f23d9077ca2a55a65d1b1655a8da4750a5dc49da7b32d90533c033fa
    Port:          <none>
    Host Port:     <none>
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
    State:          Running
      Started:      Thu, 17 Oct 2019 13:00:12 +1100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 17 Oct 2019 09:53:25 +1100
      Finished:     Thu, 17 Oct 2019 13:00:11 +1100
    Ready:          True
    Restart Count:  15
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:     200m
      memory:  512Mi

Adding support for using existing_cluster_name as well as existing_cluster_id

Currently Running a job on an existing cluster requires a cluster_id, add an option that users can provide the k8s object name of the cluster

Deleting a DJob should delete any Run

Removing the parent should delete the child.
There is a dependency between Run and djob. If you remove djob operator should delete related run.

Update Version to follow k8s GroupAPI naming convention

Based on docs and guideline here https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning It's better to follow the same standard and change the API version

Run sits in invalid state without Status or RunId if call to RunsGetOutput fails

There are several lifecycle states for a run: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runlifecyclestate.

Terminal states are TERMINATED, SKIPPED AND INTERNAL_ERROR. The other states are PENDING, RUNNING AND TERMINATING. But in certain circumstances I see that a Run (which i can see inside DB using the webui) is still showing as State: <blank> and RunId: <blank>

Now, the issue:
This condition works most of the time, but sometimes it doesn't. When the issue happens, the logs show the the following error:

2019-11-22T10:03:39.781Z        INFO    controllers.Run Refreshing run test-run
2019-11-22T10:03:49.781Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "run", "request": "kubeflow/test-run", "error": "error when refreshing run: Get https://westeurope.azuredatabricks.net/api/2.0/jobs/runs/get-output?run_id=18: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

So I guess there is some bug in the operator that is not setting life_cycle_state to a valid value when getting that exception and the step to create the run is not updating the K8s status correctly in an error state

Notebook doesn't know the name of the secret scope

The operator is creating Databricks Secret Scopes with name instance.ObjectMeta.Name and appending "_scope"

However our notebooks don't know the value of instance.ObjectMeta.Name and hence cannot construct the secret scope name.

Creating K8s SecretScope object may fail if the Databricks SecretScope exists

Setting up SecretScope may fail even if the kubernetes secret referenced in it exists.
Empty scope in databricks will be created but no content.
on kubernetes the SecretScope as no status...

status: {}
As opposed to:

status:
  secretscope:
    backend_type: DATABRICKS
    name: lbr-device-95986
Operator will then try to set it up again but since there is already an empty Scope in the remote databricks it will fail, everytime.

2019-10-16T03:35:06.786Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "secretscope", "request": "dx/lbr-alarm-95986", "error": "error when submitting secret scope to the API: Response from server (400) {\"error_code\":\"RESOURCE_ALREADY_EXISTS\",\"message\":\"Scope lbr-alarm-95986 already exists!\"}"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

I don't have log of the initial failure.

I tried to just delete the Scope in databricks but somehow, nothing happen. operator didn't try to set it up again.

in that scenario, we have to manually delete Run Djob and SecretScope and reinstall them.
Another Issue comes in when we try to delete a SecretScope that was unsuccessful. The operator crashes.
We need to manually delete the secretScope (edit remove finalizer)

2019-10-16T03:44:02.406Z    INFO    controllers.SecretScope    Starting reconcile loop for dx/lbr-alarm-95986
2019-10-16T03:44:02.406Z    INFO    controllers.SecretScope    Finish reconcile loop for dx/lbr-alarm-95986
E1016 03:44:02.406550       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:76
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:65
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/workspace/controllers/secretscope_controller_databricks.go:193
/workspace/controllers/secretscope_controller_finalizer.go:37
/workspace/controllers/secretscope_controller.go:66
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11554d1]
goroutine 307 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:58 +0x105
panic(0x12cf5c0, 0x215a8a0)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).delete(0xc0002ea360, 0xc00311a000, 0x1, 0x1471067)
    /workspace/controllers/secretscope_controller_databricks.go:193 +0x41
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).handleFinalizer(0xc0002ea360, 0xc00311a000, 0xc00000dba0, 0xc0022fc750)
    /workspace/controllers/secretscope_controller_finalizer.go:37 +0x95
github.com/microsoft/azure-databricks-operator/controllers.(*SecretScopeReconciler).Reconcile(0xc0002ea360, 0xc0015d822a, 0x2, 0xc0015d8200, 0xf, 0x216e900, 0x0, 0x0, 0x0)
    /workspace/controllers/secretscope_controller.go:66 +0x7af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000378be0, 0x131a720, 0xc001efef60, 0x131a700)
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x149
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000378be0, 0xc000276a00)
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000378be0)
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00008bee0)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00008bee0, 0x3b9aca00, 0x0, 0x1, 0xc0000881e0)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00008bee0, 0x3b9aca00, 0xc0000881e0)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x311
rpc error: code = Unknown desc = Error: No such container: 0bc65f23c711e08b3f871e3081aba9a0f10f7398e8ca37d5af233b41a1527a9f

dcluster doesn't show the STATE and NUMWORKERS

How to replicate:
Use below Yaml file

---
apiVersion: databricks.microsoft.com/v1alpha1
kind: Dcluster
metadata:
  name: interactive-cluster-1
spec:
  spark_version: latest-stable-scala2.11
  node_type_id: Standard_D3_v2
  autoscale:
    min_workers: 1
    max_workers: 6
  driver_node_type_id: Standard_D3_v2
  custom_tags:
  - key: Tag
    value: CustomTag1
  spark_env_vars:
    PYSPARK_PYTHON: /databricks/python3/bin/python3
  enable_elastic_disk: true

the run kubectl apply -f docs/samples/2_job_rub/interactive-cluster1.yaml

kubectl get dcluster                         
NAME                    AGE    CLUSTERID             STATE   NUMWORKERS
interactive-cluster-2   110m   1018-033048-slew898

then run kubectl get dcluster

Upgrade the version of kubebuilder to v2

Latest version of kubebuilder looks good, we should use that.

https://github.com/kubernetes-sigs/kubebuilder

bump up memory limit for the operator - Exit code 137 - Out of memory

error:

  Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Applying ACL on non Premium Databricks

If you're running a Databricks instance which is not on the premium tier, ACL is not available.

Regardless if your config has acls set or not, the operator will still try to list all ACLs. Listing ACLs will return Error: {"error_code":"PERMISSION_DENIED","message":"ACL is not supported in your workspace."} if you are not on the premium tier.

If ACLs are not available, the config will fail and be put back onto the reconcile loop. It will try create the secret scope again and because it already exists, fail and but put back on the loop.

Instead what should happen is:

If acls is not set in the config, don't call submitACLs.
If acls is set in the config and not on premium tier, don't put job back on the reconcile loop and an event should be logged.

Add linter to the test pipeline

Run CRD reports incorrect state information when Run fails in DataBricks

Create a job & run using sample databricks_v1alpha1_djob.yaml
The job will not work as the jar file specified in the spec is invalid.
Try and create a Run object for this job using the databricks_v1alpha1_run_job.yaml
The run will fail because the jar file is invalid, but the Operator throws an exception and never updates its state.
Log shows the following error:

2019-11-12T14:36:56.612Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: Run result unavailable: job failed with error message\n Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

This error then proceeds to keep recurring in the logs

Expected behaviour:

No error reported
Correct status reported

Randomise Names in Tests

To prevent tests from failing because of left over, undeleted deployments in the cluster, it would be good if we could make names of our deployments random.

Add Kubeflow extnesion settings

Kubeflow users should be able to use azure databricks operator

Be able to install operator in different k8s namespaces

Use semantic versioning and tag docker images

Use semantic versioning and ensure docker images are tagged with the version so they are easy to find. Currently just has a hash.

r.Update(context.Background(), instance) throws exception for newly submitted job

Reconciler error {"controller": "notebookjob", "request": "default/sample1run16", "error": "error when refreshing job to API: error when updating NotebookJob: Operation cannot be fulfilled on notebookjobs.databricks.microsoft.com "sample1run16": the object has been modified; please apply your changes to the latest version and try again"}

Release pipeline should have correct tag and reads images from MCR

Controllers do not report upstream dependency metrics

Operator calls a number of upstream dependencies via DataBricks sdk. Unfortunately at present there is no way of finding how many calls happen or how long they take.

Thinking about this from a performance standpoint it would be nice to have instrumentation on the hot-path of the code for commonly used entities such as job, cluster and run; with a framework that would allow easy extension into the other operations.

Metrics should most likely be exposed by standard K8s metric tooling Prometheus

Failed to create Dcluster object

I've just installed v0.30 and attempted to create a DCluster using the config/samples yaml.

Kubernetes version 1.13.10

I get the following error in the operator logs:

2019-10-14T17:55:03.104Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.104Z        INFO    controllers.Dcluster    AddFinalizer for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "dcluster", "request": "kubeflow/dcluster-sample"}
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Submit for kubeflow/dcluster-sample
2019-10-14T17:55:03.128Z        INFO    controllers.Dcluster    Create cluster dcluster-sample
2019-10-14T17:55:03.128Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Dcluster","namespace":"kubeflow","name":"dcluster-sample","uid":"bf7ba102-eeab-11e9-a0ba-1e18e514b3df","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"8427"}, "reason": "Added", "message": "Object finalizer is added"}
2019-10-14T17:55:10.006Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.006Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "dcluster", "request": "kubeflow/dcluster-sample"}
2019-10-14T17:55:10.006Z        INFO    controllers.Dcluster    Starting reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.007Z        INFO    controllers.Dcluster    Refresh for kubeflow/dcluster-sample
2019-10-14T17:55:10.007Z        INFO    controllers.Dcluster    Refresh cluster dcluster-sample
2019-10-14T17:55:10.007Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Dcluster","namespace":"kubeflow","name":"dcluster-sample","uid":"bf7ba102-eeab-11e9-a0ba-1e18e514b3df","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"8443"}, "reason": "Submitted", "message": "Object is submitted"}
2019-10-14T17:55:10.704Z        INFO    controllers.Dcluster    Finish reconcile loop for kubeflow/dcluster-sample
2019-10-14T17:55:10.704Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "dcluster", "request": "kubeflow/dcluster-sample", "error": "error when refreshing cluster: unexpected end of JSON input"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

$ k get dclusters.databricks.microsoft.com                                                                                                           NAME              AGE   CLUSTERID              STATE   NUMWORKERS
dcluster-sample   2m    1014-175509-erred163

$ k describe dclusters.databricks.microsoft.com  dcluster-sample                                                                                     Name:         dcluster-sample
Namespace:    kubeflow
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"databricks.microsoft.com/v1alpha1","kind":"Dcluster","metadata":{"annotations":{},"name":"dcluster-sample","namespace":"kub...
API Version:  databricks.microsoft.com/v1alpha1
Kind:         Dcluster
Metadata:
  Creation Timestamp:  2019-10-14T17:55:03Z
  Finalizers:
    dcluster.finalizers.databricks.microsoft.com
  Generation:        2
  Resource Version:  8443
  Self Link:         /apis/databricks.microsoft.com/v1alpha1/namespaces/kubeflow/dclusters/dcluster-sample
  UID:               bf7ba102-eeab-11e9-a0ba-1e18e514b3df
Spec:
  Autoscale:
    max_workers:  5
    min_workers:  2
  cluster_name:   dcluster-sample
  node_type_id:   Standard_D3_v2
  spark_version:  5.3.x-scala2.11
Status:
  cluster_info:
    cluster_cores:  0
    cluster_id:     1014-175509-erred163
Events:
  Type    Reason     Age    From                 Message
  ----    ------     ----   ----                 -------
  Normal  Added      3m40s  dcluster-controller  Object finalizer is added
  Normal  Submitted  3m33s  dcluster-controller  Object is submitted

Make Azure DevOps project public

I'm opening this up for discussion.

What are the pro's and con's of making the Azure DevOps project (the one that is building the container images) public? Read more about public projects here

How to run jobs on a schedule?

Databricks let's you run jobs on a schedule

How can I create a job that runs on a schedule using the operator?