Coder Social home page Coder Social logo

databricks / databricks-sdk-py Goto Github PK

View Code? Open in Web Editor NEW
302.0 17.0 96.0 10.17 MB

Databricks SDK for Python (Beta)

Home Page: https://databricks-sdk-py.readthedocs.io/

License: Apache License 2.0

Makefile 0.26% Python 99.74%
databricks python databricks-sdk

databricks-sdk-py's Introduction

Databricks SDK for Python (Beta)

PyPI - Downloads PyPI - License databricks-sdk PyPI codecov lines of code

Beta: This SDK is supported for production use cases, but we do expect future releases to have some interface changes; see Interface stability. We are keen to hear feedback from you on these SDKs. Please file issues, and we will address them. | See also the SDK for Java | See also the SDK for Go | See also the Terraform Provider | See also cloud-specific docs (AWS, Azure, GCP) | See also the API reference on readthedocs

The Databricks SDK for Python includes functionality to accelerate development with Python for the Databricks Lakehouse. It covers all public Databricks REST API operations. The SDK's internal HTTP client is robust and handles failures on different levels by performing intelligent retries.

Contents

Getting started

  1. Please install Databricks SDK for Python via pip install databricks-sdk and instantiate WorkspaceClient:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for c in w.clusters.list():
    print(c.cluster_name)

Databricks SDK for Python is compatible with Python 3.7 (until June 2023), 3.8, 3.9, 3.10, and 3.11.
Note: Databricks Runtime starting from version 13.1 includes a bundled version of the Python SDK.
It is highly recommended to upgrade to the latest version which you can do by running the following in a notebook cell:

%pip install --upgrade databricks-sdk

followed by

dbutils.library.restartPython()

Code examples

The Databricks SDK for Python comes with a number of examples demonstrating how to use the library for various common use-cases, including

These examples and more are located in the examples/ directory of the Github repository.

Some other examples of using the SDK include:

Authentication

If you use Databricks configuration profiles or Databricks-specific environment variables for Databricks authentication, the only code required to start working with a Databricks workspace is the following code snippet, which instructs the Databricks SDK for Python to use its default authentication flow:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w. # press <TAB> for autocompletion

The conventional name for the variable that holds the workspace-level client of the Databricks SDK for Python is w, which is shorthand for workspace.

In this section

Default authentication flow

If you run the Databricks Terraform Provider, the Databricks SDK for Go, the Databricks CLI, or applications that target the Databricks SDKs for other languages, most likely they will all interoperate nicely together. By default, the Databricks SDK for Python tries the following authentication methods, in the following order, until it succeeds:

  1. Databricks native authentication
  2. Azure native authentication
  3. If the SDK is unsuccessful at this point, it returns an authentication error and stops running.

You can instruct the Databricks SDK for Python to use a specific authentication method by setting the auth_type argument as described in the following sections.

For each authentication method, the SDK searches for compatible authentication credentials in the following locations, in the following order. Once the SDK finds a compatible set of credentials that it can use, it stops searching:

  1. Credentials that are hard-coded into configuration arguments.

    โš ๏ธ Caution: Databricks does not recommend hard-coding credentials into arguments, as they can be exposed in plain text in version control systems. Use environment variables or configuration profiles instead.

  2. Credentials in Databricks-specific environment variables.

  3. For Databricks native authentication, credentials in the .databrickscfg file's DEFAULT configuration profile from its default file location (~ for Linux or macOS, and %USERPROFILE% for Windows).

  4. For Azure native authentication, the SDK searches for credentials through the Azure CLI as needed.

Depending on the Databricks authentication method, the SDK uses the following information. Presented are the WorkspaceClient and AccountClient arguments (which have corresponding .databrickscfg file fields), their descriptions, and any corresponding environment variables.

Databricks native authentication

By default, the Databricks SDK for Python initially tries Databricks token authentication (auth_type='pat' argument). If the SDK is unsuccessful, it then tries Databricks basic (username/password) authentication (auth_type="basic" argument).

  • For Databricks token authentication, you must provide host and token; or their environment variable or .databrickscfg file field equivalents.
  • For Databricks basic authentication, you must provide host, username, and password (for AWS workspace-level operations); or host, account_id, username, and password (for AWS, Azure, or GCP account-level operations); or their environment variable or .databrickscfg file field equivalents.
Argument Description Environment variable
host (String) The Databricks host URL for either the Databricks workspace endpoint or the Databricks accounts endpoint. DATABRICKS_HOST
account_id (String) The Databricks account ID for the Databricks accounts endpoint. Only has effect when Host is either https://accounts.cloud.databricks.com/ (AWS), https://accounts.azuredatabricks.net/ (Azure), or https://accounts.gcp.databricks.com/ (GCP). DATABRICKS_ACCOUNT_ID
token (String) The Databricks personal access token (PAT) (AWS, Azure, and GCP) or Azure Active Directory (Azure AD) token (Azure). DATABRICKS_TOKEN
username (String) The Databricks username part of basic authentication. Only possible when Host is *.cloud.databricks.com (AWS). DATABRICKS_USERNAME
password (String) The Databricks password part of basic authentication. Only possible when Host is *.cloud.databricks.com (AWS). DATABRICKS_PASSWORD

For example, to use Databricks token authentication:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '), token=input('Token: '))

Azure native authentication

By default, the Databricks SDK for Python first tries Azure client secret authentication (auth_type='azure-client-secret' argument). If the SDK is unsuccessful, it then tries Azure CLI authentication (auth_type='azure-cli' argument). See Manage service principals.

The Databricks SDK for Python picks up an Azure CLI token, if you've previously authenticated as an Azure user by running az login on your machine. See Get Azure AD tokens for users by using the Azure CLI.

To authenticate as an Azure Active Directory (Azure AD) service principal, you must provide one of the following. See also Add a service principal to your Azure Databricks account:

  • azure_workspace_resource_id, azure_client_secret, azure_client_id, and azure_tenant_id; or their environment variable or .databrickscfg file field equivalents.
  • azure_workspace_resource_id and azure_use_msi; or their environment variable or .databrickscfg file field equivalents.
Argument Description Environment variable
azure_workspace_resource_id (String) The Azure Resource Manager ID for the Azure Databricks workspace, which is exchanged for a Databricks host URL. DATABRICKS_AZURE_RESOURCE_ID
azure_use_msi (Boolean) true to use Azure Managed Service Identity passwordless authentication flow for service principals. This feature is not yet implemented in the Databricks SDK for Python. ARM_USE_MSI
azure_client_secret (String) The Azure AD service principal's client secret. ARM_CLIENT_SECRET
azure_client_id (String) The Azure AD service principal's application ID. ARM_CLIENT_ID
azure_tenant_id (String) The Azure AD service principal's tenant ID. ARM_TENANT_ID
azure_environment (String) The Azure environment type (such as Public, UsGov, China, and Germany) for a specific set of API endpoints. Defaults to PUBLIC. ARM_ENVIRONMENT

For example, to use Azure client secret authentication:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
                    azure_workspace_resource_id=input('Azure Resource ID: '),
                    azure_tenant_id=input('AAD Tenant ID: '),
                    azure_client_id=input('AAD Client ID: '),
                    azure_client_secret=input('AAD Client Secret: '))

Please see more examples in this document.

Google Cloud Platform native authentication

By default, the Databricks SDK for Python first tries GCP credentials authentication (auth_type='google-credentials', argument). If the SDK is unsuccessful, it then tries Google Cloud Platform (GCP) ID authentication (auth_type='google-id', argument).

The Databricks SDK for Python picks up an OAuth token in the scope of the Google Default Application Credentials (DAC) flow. This means that if you have run gcloud auth application-default login on your development machine, or launch the application on the compute, that is allowed to impersonate the Google Cloud service account specified in google_service_account. Authentication should then work out of the box. See Creating and managing service accounts.

To authenticate as a Google Cloud service account, you must provide one of the following:

  • host and google_credentials; or their environment variable or .databrickscfg file field equivalents.
  • host and google_service_account; or their environment variable or .databrickscfg file field equivalents.
Argument Description Environment variable
google_credentials (String) GCP Service Account Credentials JSON or the location of these credentials on the local filesystem. GOOGLE_CREDENTIALS
google_service_account (String) The Google Cloud Platform (GCP) service account e-mail used for impersonation in the Default Application Credentials Flow that does not require a password. DATABRICKS_GOOGLE_SERVICE_ACCOUNT

For example, to use Google ID authentication:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
                    google_service_account=input('Google Service Account: '))

Overriding .databrickscfg

For Databricks native authentication, you can override the default behavior for using .databrickscfg as follows:

Argument Description Environment variable
profile (String) A connection profile specified within .databrickscfg to use instead of DEFAULT. DATABRICKS_CONFIG_PROFILE
config_file (String) A non-default location of the Databricks CLI credentials file. DATABRICKS_CONFIG_FILE

For example, to use a profile named MYPROFILE instead of DEFAULT:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(profile='MYPROFILE')
# Now call the Databricks workspace APIs as desired...

Additional authentication configuration options

For all authentication methods, you can override the default behavior in client arguments as follows:

Argument Description Environment variable
auth_type (String) When multiple auth attributes are available in the environment, use the auth type specified by this argument. This argument also holds the currently selected auth. DATABRICKS_AUTH_TYPE
http_timeout_seconds (Integer) Number of seconds for HTTP timeout. Default is 60. (None)
retry_timeout_seconds (Integer) Number of seconds to keep retrying HTTP requests. Default is 300 (5 minutes). (None)
debug_truncate_bytes (Integer) Truncate JSON fields in debug logs above this limit. Default is 96. DATABRICKS_DEBUG_TRUNCATE_BYTES
debug_headers (Boolean) true to debug HTTP headers of requests made by the application. Default is false, as headers contain sensitive data, such as access tokens. DATABRICKS_DEBUG_HEADERS
rate_limit (Integer) Maximum number of requests per second made to Databricks REST API. DATABRICKS_RATE_LIMIT

For example, to turn on debug HTTP headers:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_headers=True)
# Now call the Databricks workspace APIs as desired...

Long-running operations

When you invoke a long-running operation, the SDK provides a high-level API to trigger these operations and wait for the related entities to reach the correct state or return the error message in case of failure. All long-running operations return generic Wait instance with result() method to get a result of long-running operation, once it's finished. Databricks SDK for Python picks the most reasonable default timeouts for every method, but sometimes you may find yourself in a situation, where you'd want to provide datetime.timedelta() as the value of timeout argument to result() method.

There are a number of long-running operations in Databricks APIs such as managing:

  • Clusters,
  • Command execution
  • Jobs
  • Libraries
  • Delta Live Tables pipelines
  • Databricks SQL warehouses.

For example, in the Clusters API, once you create a cluster, you receive a cluster ID, and the cluster is in the PENDING state Meanwhile Databricks takes care of provisioning virtual machines from the cloud provider in the background. The cluster is only usable in the RUNNING state and so you have to wait for that state to be reached.

Another example is the API for running a job or repairing the run: right after the run starts, the run is in the PENDING state. The job is only considered to be finished when it is in either the TERMINATED or SKIPPED state. Also you would likely need the error message if the long-running operation times out and fails with an error code. Other times you may want to configure a custom timeout other than the default of 20 minutes.

In the following example, w.clusters.create returns ClusterInfo only once the cluster is in the RUNNING state, otherwise it will timeout in 10 minutes:

import datetime
import logging
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
info = w.clusters.create_and_wait(cluster_name='Created cluster',
                                  spark_version='12.0.x-scala2.12',
                                  node_type_id='m5d.large',
                                  autotermination_minutes=10,
                                  num_workers=1,
                                  timeout=datetime.timedelta(minutes=10))
logging.info(f'Created: {info}')

Please look at the examples/starting_job_and_waiting.py for a more advanced usage:

import datetime
import logging
import time

from databricks.sdk import WorkspaceClient
import databricks.sdk.service.jobs as j

w = WorkspaceClient()

# create a dummy file on DBFS that just sleeps for 10 seconds
py_on_dbfs = f'/home/{w.current_user.me().user_name}/sample.py'
with w.dbfs.open(py_on_dbfs, write=True, overwrite=True) as f:
    f.write(b'import time; time.sleep(10); print("Hello, World!")')

# trigger one-time-run job and get waiter object
waiter = w.jobs.submit(run_name=f'py-sdk-run-{time.time()}', tasks=[
    j.RunSubmitTaskSettings(
        task_key='hello_world',
        new_cluster=j.BaseClusterInfo(
            spark_version=w.clusters.select_spark_version(long_term_support=True),
            node_type_id=w.clusters.select_node_type(local_disk=True),
            num_workers=1
        ),
        spark_python_task=j.SparkPythonTask(
            python_file=f'dbfs:{py_on_dbfs}'
        ),
    )
])

logging.info(f'starting to poll: {waiter.run_id}')

# callback, that receives a polled entity between state updates
def print_status(run: j.Run):
    statuses = [f'{t.task_key}: {t.state.life_cycle_state}' for t in run.tasks]
    logging.info(f'workflow intermediate status: {", ".join(statuses)}')

# If you want to perform polling in a separate thread, process, or service,
# you can use w.jobs.wait_get_run_job_terminated_or_skipped(
#   run_id=waiter.run_id,
#   timeout=datetime.timedelta(minutes=15),
#   callback=print_status) to achieve the same results.
#
# Waiter interface allows for `w.jobs.submit(..).result()` simplicity in
# the scenarios, where you need to block the calling thread for the job to finish.
run = waiter.result(timeout=datetime.timedelta(minutes=15),
                    callback=print_status)

logging.info(f'job finished: {run.run_page_url}')

Paginated responses

On the platform side the Databricks APIs have different wait to deal with pagination:

  • Some APIs follow the offset-plus-limit pagination
  • Some start their offsets from 0 and some from 1
  • Some use the cursor-based iteration
  • Others just return all results in a single response

The Databricks SDK for Python hides this complexity under Iterator[T] abstraction, where multi-page results yield items. Python typing helps to auto-complete the individual item fields.

import logging
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for repo in w.repos.list():
    logging.info(f'Found repo: {repo.path}')

Please look at the examples/last_job_runs.py for a more advanced usage:

import logging
from collections import defaultdict
from datetime import datetime, timezone
from databricks.sdk import WorkspaceClient

latest_state = {}
all_jobs = {}
durations = defaultdict(list)

w = WorkspaceClient()
for job in w.jobs.list():
    all_jobs[job.job_id] = job
    for run in w.jobs.list_runs(job_id=job.job_id, expand_tasks=False):
        durations[job.job_id].append(run.run_duration)
        if job.job_id not in latest_state:
            latest_state[job.job_id] = run
            continue
        if run.end_time < latest_state[job.job_id].end_time:
            continue
        latest_state[job.job_id] = run

summary = []
for job_id, run in latest_state.items():
    summary.append({
        'job_name': all_jobs[job_id].settings.name,
        'last_status': run.state.result_state,
        'last_finished': datetime.fromtimestamp(run.end_time/1000, timezone.utc),
        'average_duration': sum(durations[job_id]) / len(durations[job_id])
    })

for line in sorted(summary, key=lambda s: s['last_finished'], reverse=True):
    logging.info(f'Latest: {line}')

Single-Sign-On (SSO) with OAuth

Authorization Code flow with PKCE

For a regular web app running on a server, it's recommended to use the Authorization Code Flow to obtain an Access Token and a Refresh Token. This method is considered safe because the Access Token is transmitted directly to the server hosting the app, without passing through the user's web browser and risking exposure.

To enhance the security of the Authorization Code Flow, the PKCE (Proof Key for Code Exchange) mechanism can be employed. With PKCE, the calling application generates a secret called the Code Verifier, which is verified by the authorization server. The app also creates a transform value of the Code Verifier, called the Code Challenge, and sends it over HTTPS to obtain an Authorization Code. By intercepting the Authorization Code, a malicious attacker cannot exchange it for a token without possessing the Code Verifier.

The presented sample is a Python3 script that uses the Flask web framework along with Databricks SDK for Python to demonstrate how to implement the OAuth Authorization Code flow with PKCE security. It can be used to build an app where each user uses their identity to access Databricks resources. The script can be executed with or without client and secret credentials for a custom OAuth app.

Databricks SDK for Python exposes the oauth_client.initiate_consent() helper to acquire user redirect URL and initiate PKCE state verification. Application developers are expected to persist RefreshableCredentials in the webapp session and restore it via RefreshableCredentials.from_dict(oauth_client, session['creds']) helpers.

Works for both AWS and Azure. Not supported for GCP at the moment.

from databricks.sdk.oauth import OAuthClient

oauth_client = OAuthClient(host='<workspace-url>',
                           client_id='<oauth client ID>',
                           redirect_url=f'http://host.domain/callback',
                           scopes=['clusters'])

import secrets
from flask import Flask, render_template_string, request, redirect, url_for, session

APP_NAME = 'flask-demo'
app = Flask(APP_NAME)
app.secret_key = secrets.token_urlsafe(32)


@app.route('/callback')
def callback():
   from databricks.sdk.oauth import Consent
   consent = Consent.from_dict(oauth_client, session['consent'])
   session['creds'] = consent.exchange_callback_parameters(request.args).as_dict()
   return redirect(url_for('index'))


@app.route('/')
def index():
   if 'creds' not in session:
      consent = oauth_client.initiate_consent()
      session['consent'] = consent.as_dict()
      return redirect(consent.auth_url)

   from databricks.sdk import WorkspaceClient
   from databricks.sdk.oauth import SessionCredentials

   credentials_provider = SessionCredentials.from_dict(oauth_client, session['creds'])
   workspace_client = WorkspaceClient(host=oauth_client.host,
                                      product=APP_NAME,
                                      credentials_provider=credentials_provider)

   return render_template_string('...', w=workspace_client)

SSO for local scripts on development machines

For applications, that do run on developer workstations, Databricks SDK for Python provides auth_type='external-browser' utility, that opens up a browser for a user to go through SSO flow. Azure support is still in the early experimental stage.

from databricks.sdk import WorkspaceClient

host = input('Enter Databricks host: ')

w = WorkspaceClient(host=host, auth_type='external-browser')
clusters = w.clusters.list()

for cl in clusters:
    print(f' - {cl.cluster_name} is {cl.state}')

Creating custom OAuth applications

In order to use OAuth with Databricks SDK for Python, you should use account_client.custom_app_integration.create API.

import logging, getpass
from databricks.sdk import AccountClient
account_client = AccountClient(host='https://accounts.cloud.databricks.com',
                               account_id=input('Databricks Account ID: '),
                               username=input('Username: '),
                               password=getpass.getpass('Password: '))

logging.info('Enrolling all published apps...')
account_client.o_auth_enrollment.create(enable_all_published_apps=True)

status = account_client.o_auth_enrollment.get()
logging.info(f'Enrolled all published apps: {status}')

custom_app = account_client.custom_app_integration.create(
    name='awesome-app',
    redirect_urls=[f'https://host.domain/path/to/callback'],
    confidential=True)
logging.info(f'Created new custom app: '
             f'--client_id {custom_app.client_id} '
             f'--client_secret {custom_app.client_secret}')

Error handling

The Databricks SDK for Python provides a robust error-handling mechanism that allows developers to catch and handle API errors. When an error occurs, the SDK will raise an exception that contains information about the error, such as the HTTP status code, error message, and error details. Developers can catch these exceptions and handle them appropriately in their code.

from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import ResourceDoesNotExist

w = WorkspaceClient()
try:
    w.clusters.get(cluster_id='1234-5678-9012')
except ResourceDoesNotExist as e:
    print(f'Cluster not found: {e}')

The SDK handles inconsistencies in error responses amongst the different services, providing a consistent interface for developers to work with. Simply catch the appropriate exception type and handle the error as needed. The errors returned by the Databricks API are defined in databricks/sdk/errors/platform.py.

Logging

The Databricks SDK for Python seamlessly integrates with the standard Logging facility for Python. This allows developers to easily enable and customize logging for their Databricks Python projects. To enable debug logging in your Databricks Python project, you can follow the example below:

import logging, sys
logging.basicConfig(stream=sys.stderr,
                    level=logging.INFO,
                    format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')
logging.getLogger('databricks.sdk').setLevel(logging.DEBUG)

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_truncate_bytes=1024, debug_headers=False)
for cluster in w.clusters.list():
    logging.info(f'Found cluster: {cluster.cluster_name}')

In the above code snippet, the logging module is imported and the basicConfig() method is used to set the logging level to DEBUG. This will enable logging at the debug level and above. Developers can adjust the logging level as needed to control the verbosity of the logging output. The SDK will log all requests and responses to standard error output, using the format > for requests and < for responses. In some cases, requests or responses may be truncated due to size considerations. If this occurs, the log message will include the text ... (XXX additional elements) to indicate that the request or response has been truncated. To increase the truncation limits, developers can set the debug_truncate_bytes configuration property or the DATABRICKS_DEBUG_TRUNCATE_BYTES environment variable. To protect sensitive data, such as authentication tokens, passwords, or any HTTP headers, the SDK will automatically replace these values with **REDACTED** in the log output. Developers can disable this redaction by setting the debug_headers configuration property to True.

2023-03-22 21:19:21,702 [databricks.sdk][DEBUG] GET /api/2.0/clusters/list
< 200 OK
< {
<   "clusters": [
<     {
<       "autotermination_minutes": 60,
<       "cluster_id": "1109-115255-s1w13zjj",
<       "cluster_name": "DEFAULT Test Cluster",
<       ... truncated for brevity
<     },
<     "... (47 additional elements)"
<   ]
< }

Overall, the logging capabilities provided by the Databricks SDK for Python can be a powerful tool for monitoring and troubleshooting your Databricks Python projects. Developers can use the various logging methods and configuration options provided by the SDK to customize the logging output to their specific needs.

Interaction with dbutils

You can use the client-side implementation of dbutils by accessing dbutils property on the WorkspaceClient. Most of the dbutils.fs operations and dbutils.secrets are implemented natively in Python within Databricks SDK. Non-SDK implementations still require a Databricks cluster, that you have to specify through the cluster_id configuration attribute or DATABRICKS_CLUSTER_ID environment variable. Don't worry if cluster is not running: internally, Databricks SDK for Python calls w.clusters.ensure_cluster_is_running().

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
dbutils = w.dbutils

files_in_root = dbutils.fs.ls('/')
print(f'number of files in root: {len(files_in_root)}')

Alternatively, you can import dbutils from databricks.sdk.runtime module, but you have to make sure that all configuration is already present in the environment variables:

from databricks.sdk.runtime import dbutils

for secret_scope in dbutils.secrets.listScopes():
    for secret_metadata in dbutils.secrets.list(secret_scope.name):
        print(f'found {secret_metadata.key} secret in {secret_scope.name} scope')

Interface stability

Databricks is actively working on stabilizing the Databricks SDK for Python's interfaces. API clients for all services are generated from specification files that are synchronized from the main platform. You are highly encouraged to pin the exact dependency version and read the changelog where Databricks documents the changes. Databricks may have minor documented backward-incompatible changes, such as renaming some type names to bring more consistency.

databricks-sdk-py's People

Contributors

alex7c4 avatar alexott avatar bonnetn avatar chrisgravel-db avatar cornzyblack avatar dby-tmwctw avatar edwardfeng-db avatar fjakobs avatar grusin-db avatar hamza-db avatar hectorcast-db avatar johnjdyer avatar judahrand avatar kartikgupta-db avatar kimberlyma avatar lennartkats-db avatar mgyucht avatar michaelspece avatar nfx avatar nkvuong avatar nodejsmith avatar pawaritl avatar pietern avatar pranavchiku avatar rafaelvp-db avatar saadansari-db avatar shreyas-goenka avatar susodapop avatar tanmay-db avatar vsamoilov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

databricks-sdk-py's Issues

[BUG] in jobs.py: TypeError: Object of type AccessControlRequest is not JSON serializable

In

if self.access_control_list: body['access_control_list'] = [v for v in self.access_control_list]

the AccessControlRequest objects are put in the dict without modifying. This leads to

TypeError: Object of type AccessControlRequest is not JSON serializable

when calling JobsAPI.create

instead the line should be

if self.access_control_list: body['access_control_list'] = [v.as_dict() for v in self.access_control_list]

[BUG] in jobs.py: TypeError: Object of type Library is not JSON serializable

In

if self.libraries: body['libraries'] = [v for v in self.libraries]

the Library objects are put in the dict without modifying. This leads to

TypeError: Object of type Library is not JSON serializable

when calling JobsAPI.create with a Libray object set in a JobTask

instead the line should be

if self.libraries: body['libraries'] = [v.as_dict() for v in self.libraries]

metastore.delete force does not force

I believe the issue is on the API side but
workspace_client.metastores.delete(id=metastoreId',force=True) will not force delete the metastore

I am having to send force as url parameter
workspace_client.metastores.delete(id=metastoreId+'?force=true',force=True) which is adding parameter force=true in url and not message body

Please loosen the dependency on `requests`

This package currently depends on requests <2.29.0,>=2.28.1, which is a very tight range and unfortunately has published vulnerabilities. Please widen the allowed range (I suggest to requests <3,>=2.28.1 if forwards compatibility is important) so that this package does not force insecure dependent packages on its consumers.

Error when creating single node job clusters

When creating a single node job cluster, an error is thrown if the num_workers is set to 0. However, this is the same setting used when inspecting the JSON structure of a single node job cluster.

Error message:
databricks.sdk.core.DatabricksError: Cluster validation error: Missing required field: settings.cluster_spec.new_cluster.size

Cluster configuraiton:

{
"job_cluster_key": "ascend_ingest",
"new_cluster": {
"spark_version": "12.2.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true",
"spark.master": "local[*, 4]",
"spark.databricks.cluster.profile": "singleNode",
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1,
},
"node_type_id": "Standard_DS3_v2",
"custom_tags": {"ResourceClass": "SingleNode"},
"spark_env_vars": {"PYSPARK_PYTHON": "/databricks/python3/bin/python3"},
"enable_elastic_disk": True,
"data_security_mode": "SINGLE_USER",
"runtime_engine": "STANDARD",

            ## Seems theres a bug that won't allow for single node clusters
            "num_workers": 0,
        },
    }

Note that changing num_workers to 1 resolves the issue.

dbutils.fs.mount generates wrong proxy call

When calling "dbutils.fs.mount" the proxy call towards the cluster is using the wrong dictionary names for the parameters.
The generated code looks like this:

        import json
        (args, kwargs) = json.loads('[[], {"source": "<someUrl>", "mountPoint": "/mnt/<myMountPoint", "encryptionType": "", "owner": "", "extraConfigs": null}]')
        result = dbutils.fs.mount(*args, **kwargs)
        dbutils.notebook.exit(json.dumps(result))

When this is executed in the cluster, the following error is thrown:
TypeError: DBUtils.FSHandler.mount() got an unexpected keyword argument 'mountPoint'
Further more this error is not proxied back - instead you get TypeError: the JSON object must be str, bytes or bytearray, not NoneType (dbutils.py:245)

The actual problem are the wrong parameter names in the Method-Signatur of "mount".
The original DataBricks-API expects the parameters with underscores instead of camelCase (see https://docs.databricks.com/dbfs/mounts.html):

dbutils.fs.mount(
  source: str,
  mount_point: str,
  encryption_type: Optional[str] = "",
  extra_configs: Optional[dict[str:str]] = None
)

Same probably applies to the other Mount-Methods.

w.alerts.list(): 'datetime-range' is not a valid ParameterType

  1. log into https://e2-demo-field-eng.cloud.databricks.com/?o=1444828305810485
  2. run: w.alerts.list()

I am not sure which alert causes this issue, it would be great if debug=True (or something of this sort) was passable as kwargs to enable debugging of element that caused exception (or is there better way to do it from notebooks?)

Exception:

ValueError                                Traceback (most recent call last)
<command-3538949529976490> in <cell line: 1>()
----> 1 w.alerts.list()

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in list(self)
   2198 
   2199         json = self._api.do('GET', '/api/2.0/preview/sql/alerts')
-> 2200         return [Alert.from_dict(v) for v in json]
   2201 
   2202     def list_schedules(self, alert_id: str, **kwargs) -> Iterator[RefreshSchedule]:

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in <listcomp>(.0)
   2198 
   2199         json = self._api.do('GET', '/api/2.0/preview/sql/alerts')
-> 2200         return [Alert.from_dict(v) for v in json]
   2201 
   2202     def list_schedules(self, alert_id: str, **kwargs) -> Iterator[RefreshSchedule]:

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in from_dict(cls, d)
     70                    name=d.get('name', None),
     71                    options=AlertOptions.from_dict(d['options']) if 'options' in d else None,
---> 72                    query=Query.from_dict(d['query']) if 'query' in d else None,
     73                    rearm=d.get('rearm', None),
     74                    state=AlertState(d['state']) if 'state' in d else None,

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in from_dict(cls, d)
   1201             latest_query_data_id=d.get('latest_query_data_id', None),
   1202             name=d.get('name', None),
-> 1203             options=QueryOptions.from_dict(d['options']) if 'options' in d else None,
   1204             permission_tier=PermissionLevel(d['permission_tier']) if 'permission_tier' in d else None,
   1205             query=d.get('query', None),

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in from_dict(cls, d)
   1454     def from_dict(cls, d: Dict[str, any]) -> 'QueryOptions':
   1455         return cls(moved_to_trash_at=d.get('moved_to_trash_at', None),
-> 1456                    parameters=[Parameter.from_dict(v)
   1457                                for v in d['parameters']] if 'parameters' in d else None)
   1458 

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in <listcomp>(.0)
   1454     def from_dict(cls, d: Dict[str, any]) -> 'QueryOptions':
   1455         return cls(moved_to_trash_at=d.get('moved_to_trash_at', None),
-> 1456                    parameters=[Parameter.from_dict(v)
   1457                                for v in d['parameters']] if 'parameters' in d else None)
   1458 

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/sql.py in from_dict(cls, d)
   1101         return cls(name=d.get('name', None),
   1102                    title=d.get('title', None),
-> 1103                    type=ParameterType(d['type']) if 'type' in d else None,
   1104                    value=d.get('value', None))
   1105 

/usr/lib/python3.9/enum.py in __call__(cls, value, names, module, qualname, type, start)
    358         """
    359         if names is None:  # simple value lookup
--> 360             return cls.__new__(cls, value)
    361         # otherwise, functional API: we're creating a new Enum type
    362         return cls._create_(

/usr/lib/python3.9/enum.py in __new__(cls, value)
    676                 ve_exc = ValueError("%r is not a valid %s" % (value, cls.__qualname__))
    677                 if result is None and exc is None:
--> 678                     raise ve_exc
    679                 elif exc is None:
    680                     exc = TypeError(

ValueError: 'datetime-range' is not a valid ParameterType

Error "'LEGACY_SINGLE_USER_STANDARD' is not a valid DataSecurityMode" when trying to list available cluster names

Repro steps:

  1. Install version 0.0.1 of the Databricks SDK for Python.
  2. Run the following code:
from databricks.sdk import WorkspaceClient

w = WorkspaceClient(host = "https://db-sme-demo-docs.cloud.databricks.com", token = "REDACTED")

for c in w.clusters.list():
  print(c.cluster_name)

Expected:

  • A list of available cluster names in my workspace.

Actual:

  • This error:
Traceback (most recent call last):
  File "/Users/paul.cornell/databricks-python-sdk-demo/main.py", line 5, in <module>
    for c in w.clusters.list():
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-Otax6dmi/lib/python3.9/site-packages/databricks/sdk/service/clusters.py", line 1908, in list
    return [ClusterInfo.from_dict(v) for v in json['clusters']]
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-Otax6dmi/lib/python3.9/site-packages/databricks/sdk/service/clusters.py", line 1908, in <listcomp>
    return [ClusterInfo.from_dict(v) for v in json['clusters']]
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-Otax6dmi/lib/python3.9/site-packages/databricks/sdk/service/clusters.py", line 410, in from_dict
    data_security_mode=DataSecurityMode(d['data_security_mode'])
  File "/usr/local/Cellar/[email protected]/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/enum.py", line 384, in __call__
    return cls.__new__(cls, value)
  File "/usr/local/Cellar/[email protected]/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/enum.py", line 702, in __new__
    raise ve_exc
ValueError: 'LEGACY_SINGLE_USER_STANDARD' is not a valid DataSecurityMode

w.workspace.list('/'): 'MLFLOW_EXPERIMENT' is not a valid ObjectType

When listing files in a workspace folder that contains MFflow experiments (files with a glass bottle icon), exception is thrown.

how to reproduce:

  1. put mlflow experiment type of file in a workspace root
  2. w.workspace.list('/')

Exception thrown:

ValueError                                Traceback (most recent call last)
<command-3538949529974781> in <cell line: 1>()
----> 1 w.workspace.list('/')

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/workspace.py in list(self, path, notebooks_modified_after, **kwargs)
    283 
    284         json = self._api.do('GET', '/api/2.0/workspace/list', query=query)
--> 285         return [ObjectInfo.from_dict(v) for v in json['objects']]
    286 
    287     def mkdirs(self, path: str, **kwargs):

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/workspace.py in <listcomp>(.0)
    283 
    284         json = self._api.do('GET', '/api/2.0/workspace/list', query=query)
--> 285         return [ObjectInfo.from_dict(v) for v in json['objects']]
    286 
    287     def mkdirs(self, path: str, **kwargs):

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/workspace.py in from_dict(cls, d)
    166                    modified_at=d.get('modified_at', None),
    167                    object_id=d.get('object_id', None),
--> 168                    object_type=ObjectType(d['object_type']) if 'object_type' in d else None,
    169                    path=d.get('path', None),
    170                    size=d.get('size', None))

/usr/lib/python3.9/enum.py in __call__(cls, value, names, module, qualname, type, start)
    358         """
    359         if names is None:  # simple value lookup
--> 360             return cls.__new__(cls, value)
    361         # otherwise, functional API: we're creating a new Enum type
    362         return cls._create_(

/usr/lib/python3.9/enum.py in __new__(cls, value)
    676                 ve_exc = ValueError("%r is not a valid %s" % (value, cls.__qualname__))
    677                 if result is None and exc is None:
--> 678                     raise ve_exc
    679                 elif exc is None:
    680                     exc = TypeError(

ValueError: 'MLFLOW_EXPERIMENT' is not a valid ObjectType

Python import error when using `auth_type="oauth-m2m"`

Import error when using auth_type="oauth-m2m"
When running this python statement:

a = AccountClient(auth_type="oauth-m2m", profile="E2CERTACCT")

I found the error below while running in the python debugger.

 File "/Users/xxxxxx.xxxxxxxx/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 23, in <module>
    from .azure import ARM_DATABRICKS_RESOURCE_ID, ENVIRONMENTS, AzureEnvironment
ImportError: attempted relative import with no known parent package

log:

2023-06-09 22:53:01,090 [databricks.sdk][INFO] loading E2CERTACCT profile from ~/.databrickscfg: host, account_id, client_id, client_secret
2023-06-09 22:53:01,090 [databricks.sdk][DEBUG] Ignoring pat auth, because oauth-m2m is preferred
2023-06-09 22:53:01,090 [databricks.sdk][DEBUG] Ignoring basic auth, because oauth-m2m is preferred
2023-06-09 22:53:01,090 [databricks.sdk][DEBUG] Ignoring metadata-service auth, because oauth-m2m is preferred
2023-06-09 22:53:01,090 [databricks.sdk][DEBUG] Attempting to configure auth: oauth-m2m
Traceback (most recent call last):
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 415, in __call__
    header_factory = provider(cfg)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 60, in wrapper
    return func(cfg)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 114, in oauth_service_principal
    token_url=resp.json()["token_endpoint"],
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 775, in _init_auth
    self._header_factory = self._credentials_provider(self)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 421, in __call__
    raise ValueError(f'{auth_type}: {e}') from e
ValueError: oauth-m2m: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 496, in __init__
    self._init_auth()
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 780, in _init_auth
    raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: oauth-m2m: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/douglas.moore/development/dba-helper/permissions-graph/extract_account_principals.py", line 44, in <module>
    a = AccountClient(auth_type="oauth-m2m", profile="E2CERTACCT", debug_headers=True, debug_truncate_bytes=300)
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/__init__.py", line 192, in __init__
    config = client.Config(host=host,
  File "/Users/douglas.moore/.pyenv/versions/3.8.12/lib/python3.8/site-packages/databricks/sdk/core.py", line 501, in __init__
    raise ValueError(message) from e
ValueError: default auth: oauth-m2m: Expecting value: line 1 column 1 (char 0). Config: host=https://accounts.cloud.databricks.com, account_id=deadbeef-deadbeef-deadbeef, client_id=dead999-1234-4321-9999-deadbeef, client_secret=***, profile=E2CERTACCT, auth_type=oauth-m2m, debug_truncate_bytes=300, debug_headers=True

Error using Tables API

Errors received from trying to list tables in the workspace

import os

from databricks.sdk import WorkspaceClient


if __name__ == "__main__":  
    w = WorkspaceClient(
        host="https://2111501043581247.7.gcp.databricks.com/",
        token=os.environ['PAT_JAPAN'])
    w.jobs.delete()

    for t in w.tables.list():
        print(t)
Traceback (most recent call last):
  File "/Users/gant.kuln/test_databricks_py/helloworld.py", line 10, in <module>
    for k in w.tables.list():
  File "/Users/gant.kuln/miniconda3/envs/test_databricks_py/lib/python3.10/site-packages/databricks/sdk/service/unitycatalog.py", line 3081, in list
    json = self._api.do('GET', '/api/2.1/unity-catalog/tables', query=query)
  File "/Users/gant.kuln/miniconda3/envs/test_databricks_py/lib/python3.10/site-packages/databricks/sdk/client.py", line 416, in do
    raise DatabricksError(**response.json())
TypeError: DatabricksError.__init__() got an unexpected keyword argument 'details'
(test_databricks_py) gant.kuln@JDY0V6H3X2 test_databricks_py % 

Likely incomplete isAzure function

The code checks for only .azuredatabricks.net, but I am pretty sure we support other host patterns. Some of which are:

".azuredatabricks.net",
".databricks.azure.cn",
".databricks.azure.us",

listing networks don't work

here is the code:
account = AccountClient(host="https://accounts.cloud.databricks.com",account_id=db_account_id, username=db_username, password=db_password)

account.networks.list()

I used e2-certification account.

here is the error:

ValueError Traceback (most recent call last)
in <cell line: 8>()
6 verifier = Verifier(db_account_id, db_username, db_password, aws_key, aws_secret)
7
----> 8 verifier.run_private_link_check("3324600082051037")

in run_private_link_check(self, workspace_id)
80
81 def run_private_link_check(self, workspace_id:str):
---> 82 db_network_id = self.get_db_network_id(workspace_id)
83 vpc_id = self.get_vpc_id(workspace_id)
84 vpc_endpoints_ids = self.get_vpc_endpoint_ids(network_id)

in get_db_network_id(self, workspace_id)
24 def get_db_network_id(self, workspace_id:str):
25 #get network id.
---> 26 for network in self.account.networks.list():
27 if network.workspace_id == self.workspace_id:
28 return network.network_id

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/deployment.py in list(self)
1292
1293 json = self._api.do('GET', f'/api/2.0/accounts/{self._api.account_id}/networks')
-> 1294 return [Network.from_dict(v) for v in json]
1295
1296

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/deployment.py in (.0)
1292
1293 json = self._api.do('GET', f'/api/2.0/accounts/{self._api.account_id}/networks')
-> 1294 return [Network.from_dict(v) for v in json]
1295
1296

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/deployment.py in from_dict(cls, d)
653 vpc_id=d.get('vpc_id', None),
654 vpc_status=VpcStatus(d['vpc_status']) if 'vpc_status' in d else None,
--> 655 warning_messages=[NetworkWarning.from_dict(v)
656 for v in d['warning_messages']] if 'warning_messages' in d else None,
657 workspace_id=d.get('workspace_id', None))

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/deployment.py in (.0)
653 vpc_id=d.get('vpc_id', None),
654 vpc_status=VpcStatus(d['vpc_status']) if 'vpc_status' in d else None,
--> 655 warning_messages=[NetworkWarning.from_dict(v)
656 for v in d['warning_messages']] if 'warning_messages' in d else None,
657 workspace_id=d.get('workspace_id', None))

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/deployment.py in from_dict(cls, d)
710 def from_dict(cls, d: Dict[str, any]) -> 'NetworkWarning':
711 return cls(warning_message=d.get('warning_message', None),
--> 712 warning_type=WarningType(d['warning_type']) if 'warning_type' in d else None)
713
714

/usr/lib/python3.9/enum.py in call(cls, value, names, module, qualname, type, start)
358 """
359 if names is None: # simple value lookup
--> 360 return cls.new(cls, value)
361 # otherwise, functional API: we're creating a new Enum type
362 return cls.create(

/usr/lib/python3.9/enum.py in new(cls, value)
676 ve_exc = ValueError("%r is not a valid %s" % (value, cls.qualname))
677 if result is None and exc is None:
--> 678 raise ve_exc
679 elif exc is None:
680 exc = TypeError(

ValueError: 'vpc' is not a valid WarningType

Better interface for client credentials

Right now it is quite easy to use OAuthClient. It encapsulates Azure vs AWS differences and it returns a credentials_provider, which is nice. However, ClientCredentials is very crude and does nothing of the above.

One proposal [preferred] would be to encapsulate ClientCredentials within OAuthClient and have a single client to worry about.

An alternative would be to have a ClientCredentialsClient that provides a cloud-agnostic API and returns a credentials_provider.

Enum Generator: allow passing strings for parameters which are enums to make calling more pythonic

Problem

Generated Enum types does not have support of being instantiated from string, hence it forces end users to not use non pythonic syntax of:

from databricks.sdk.service.workspace import ExportFormat

w.workspace.export('/my_notebook', format=ExportFormat.SOURCE)

instead of neat, short, and pythonic:

w.workspace.export('/my_notebook', format='SOURCE')

throws an exception:

AttributeError                            Traceback (most recent call last)
<command-3538949529976490> in <cell line: 1>()
----> 1 w.workspace.export('/Users/[email protected]/auto-dlt/generated/gr_products/gr_bronze-autoloader', format='SOURCE')

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/databricks/sdk/service/workspace.py in export(self, path, direct_download, format, **kwargs)
    225         query = {}
    226         if direct_download: query['direct_download'] = request.direct_download
--> 227         if format: query['format'] = request.format.value
    228         if path: query['path'] = request.path
    229 

AttributeError: 'str' object has no attribute 'value'

Example

For example:

help(w.workspace.export)

has signature of:

export(path: str, *, direct_download: bool = None, format: databricks.sdk.service.workspace.ExportFormat = None, **kwargs) -> databricks.sdk.service.workspace.ExportResponse method of databricks.sdk.service.workspace.WorkspaceAPI instance

in turn parameter format is of Enum type defined in databricks.sdk.service.workspace.ExportFormat as

class ExportFormat(Enum):
    """This specifies the format of the file to be imported. By default, this is `SOURCE`. However it
    may be one of: `SOURCE`, `HTML`, `JUPYTER`, `DBC`. The value is case sensitive."""

    DBC = 'DBC'
    HTML = 'HTML'
    JUPYTER = 'JUPYTER'
    R_MARKDOWN = 'R_MARKDOWN'
    SOURCE = 'SOURCE'

Relax requests dependency constraint once psf/requests issue 6432 is resolved

Currently we add a strict upper bound of <2.29 to the requests library. Requests 2.30+ is incompatible with urllib <2, but users of databricks-sdk may still depend on older versions of urllib. Once psf/requests#6432 is resolved, we should relax the upper bound to allow more recent versions of the requests library which incorporate the most recent release of urllib. This should improve the security posture of the SDK.

OAuth breaks if a bad .netrc file is present

Description

There is a hidden behaviour in requests where a .netrc will silently override provided authentication headers unless manually overridden.

We first observed this in dbt-databricks because of databricks/dbt-databricks#337. Fixing it however required two changes: one to dbt-databricks and one to databricks-sdk-py. The fix is simple: override the default behaviour of requests by supplying a custom AuthBase class.

  1. The fix to dbt-databricks has to do with Python models which use requests and a REST API (instead of thrift server) to check if the all-purpose cluster is running.
  2. The fix to databricks-sdk-py (this issue) is around OAuth, because the SDK uses requests to perform the OAuth handshake. This step is only needed where OAuth is used to authenticate dbt-databricks

Steps to Reproduce

There is probably an easier way to reproduce this with a short script. But I'm using what I found while developing dbt-databricks.

  1. Checkout this code from dbt-databricks databricks/dbt-databricks#338. This branch incorporates the first fix I described above.

  2. Add an intentionally bad ~/.netrc to your workstation, like this:

machine <my-workspace>.cloud.databricks.com
login token
password <expired_token>
  1. Try to run the Python test_python_uc_sql_endpoint integration test after updating _build_databricks_cluster_target in tests/profiles.py to comment out the "token" key. This forces dbt-databricks to use OAuth instead.
  2. Observe that the test fails with this error:
E           ValueError: b'{"errorCode":"invalid_client","errorSummary":"Invalid value for \'client_id\' parameter.","errorLink":"invalid_client","errorId":"oaeLJQz1r35SrSNVtVcJUig0A","errorCauses":[]}'

This happens because without the override applied, the SDK includes authentication headers for a REST API request that doesn't require authentication and the server kicks back an invalid value for 'client_id' error.

I'm about to open a PR that fixes this.

There is a related issue on databricks-sql-python (which implements its own oauth process). That fix is the same as this one.

Log to stdErr from requests.get

When sending the request to get the token, there is a log to stderr that would be awesome to silence as it mixes with clients:

Screen Shot 2023-04-13 at 4 49 57 PM

[BUG] Always return `None` from `model_registry.get_model()`

When calling API WorkspaceClient.model_registry.get_model(), it always returns GetModelResponse(registered_model=None).

However, the returned value is correct when I call the Web API directly with requests by passing the param.

url = f"{host}/api/2.0/mlflow/databricks/registered-models/get"
resp = requests.get(url, headers=headers, params={"name": "model_name"})

Could you please take a look?

Code completion works in unexpected ways when the Databricks extension for Visual Studio Code is also installed

Repro steps:

  1. Install the Databricks extension for Visual Studio Code and enable globals, which imports a databricks.sdk namespace so that you I access the dbutils global from my Python code.
  2. Install version 0.0.1 of the Python wheel for the Databricks SDK for Python.

Expect:

  • Get IntelliSense (code completion) for both dbutils and the Databricks SDK for Python classes.

Actual:

  1. I get IntelliSense as I type for this: from databricks.sdk.runtime import dbutils. But then databricks.sdk.runtime squiggles with Import "databricks.sdk.runtime" could not be resolved from sourcePylancereportMissingModuleSource.
  2. This doesn't work at all for IntelliSense as I type: from databricks.sdk import WorkspaceClient. However, any code that relies on this import runs as expected.
  3. When I type from databricks.sdk import and then press Ctrl + Enter, I only get a drop-down with runtime. I was expecting a longer list with Databricks SDK for Python classes, for example AccountClient and WorkspaceClient.

PAT auth and config profile auth not recognized

With version 0.0.2 of the SDK installed locally, I have a local config profile named DEFAULT defined in my ~/.databrickscfg file. I have no local DATABRICKS_* environment variables defined.

The following code returns the error databricks.sdk.client.DatabricksError: No auth configured:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

for c in w.clusters.list():
  print(c.cluster_name)

So does the following code:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient(profile = 'DEFAULT')

for c in w.clusters.list():
  print(c.cluster_name)

If I set my local DATABRICKS_HOST and DATABRICKS_TOKEN environment variables however, the following code runs as expected with no errors:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

for c in w.clusters.list():
  print(c.cluster_name)

I would expect the SDK to recognize my local DEFAULT profile to be recognized without having to specify profile = 'DEFAULT'. Also, I should be able to set my local DEFAULT profile and not be forced to set local DATABRICKS_* environment variables if I don't want to.

OAuth Cross-origin error

Getting Cross-origin token redemption is permitted only for the 'Single-Page Application' client-type. Request origin: 'http://localhost:8020/'. when using OAuth client with Azure.

It works if I remove the code for both u2m and m2m:

        if 'microsoft' in self._client.token_url:
            # Tokens issued for the 'Single-Page Application' client-type may
            # only be redeemed via cross-origin requests
            headers = {'Origin': self._client.redirect_url}

Query History examples missing

The Examples and the README is not helpful for using the SDK when it comes to Query History APIs.

The documentation and user guide needs to be improved else this is not useful and I have to use the python requests package and do everything by myself from scratch.

Error when getting Git credentials

Hi,

I'm trying to use the Python SDK to programatically create (if not already existing) a set of Git credentials on my Databricks environment. Getting the credentials gives the traceback below. I'm calling the SDK as such:

from databricks.sdk import GitCredentialsAPI
from databricks_cli.sdk.api_client import ApiClient

self.git_creds = GitCredentialsAPI(self.api_client)
self.git_creds.list()

Is the ApiClient you are using by any chance not the same class as the one I'm importing?

Thanks!

Traceback (most recent call last):
File "/home/georgelpreput/Source/pushcart-deploy/.venv/bin/pushcart-deploy", line 6, in
sys.exit(deploy())
^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/.venv/lib/python3.11/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/.venv/lib/python3.11/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/.venv/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/.venv/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/src/pushcart_deploy/setup.py", line 118, in deploy
d.deploy()
File "/home/georgelpreput/Source/pushcart-deploy/src/pushcart_deploy/setup.py", line 69, in deploy
_ = self.repos.get_or_create_git_credentials()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/src/pushcart_deploy/databricks_api/repos_wrapper.py", line 81, in get_or_create_git_credentials
c for c in self.git_creds.list() if c["git_username"] == git_username
^^^^^^^^^^^^^^^^^^^^^
File "/home/georgelpreput/Source/pushcart-deploy/.venv/lib/python3.11/site-packages/databricks/sdk/service/workspace.py", line 759, in list
json = self._api.do('GET', '/api/2.0/git-credentials')
^^^^^^^^^^^^
AttributeError: 'ApiClient' object has no attribute 'do'

Missing init_scripts in Cluster Creation?

Hello,

I believe the current method to create a cluster in the SDK is missing init_scripts (which is a part of the APIs- Create new cluster). I am currently using a custom class which adds the init_scripts part to the body of the request.

Is adding init_scripts during cluster creation through the SDK planned?

Thank you.

Use the sdk from within databricks?

Hello,

I am looking for the simplest to automate submitting a large amount of jobs. I see that databricks_cli https://docs.databricks.com/dev-tools/cli/index.html does not seem to be in active development, so I am looking at this project as a solution for automation.

I would like to be able to run this sdk from within my databricks workspace, but when I try the example https://docs.databricks.com/dev-tools/sdk-python.html#get-started-with-the-databricks-sdk-for-python, I get the error message default auth: cannot configure default credentials

I assume that is because databricks-sdk-py is unaware that I am running from within my databricks workspace. My question is, is there a way to make the module recognize that and assume my credentials?

Create job API call returns TypeError of not JSON serializable

Code:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import JobTaskSettings, NotebookTask, NotebookTaskSource

w = WorkspaceClient()

job_name            = input("Some short name for the job (for example, my-job): ")
description         = input("Some short description for the job (for example, My job): ")
existing_cluster_id = input("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4): ")
notebook_path       = input("Workspace path of the notebook to run (for example, /Users/[email protected]/my-notebook): ")
task_key            = input("Some key to apply to the job's tasks (for example, my-key): ")

print("Attempting to run the job. Please wait...\n")

j = w.jobs.create(
  name = job_name,
  tasks = [
    JobTaskSettings(
      description = description,
      existing_cluster_id = existing_cluster_id,
      notebook_task = NotebookTask(
        base_parameters = {""},
        notebook_path = notebook_path,
        source = NotebookTaskSource("WORKSPACE")
      ),
      task_key = task_key
    )
  ]
)

print(f"View the job at {w.config.host}/#job/{j.job_id}\n")

Input:

Some short name for the job (for example, my-job): my-job
Some short description for the job (for example, My job): My job
ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4): <CLUSTER-ID-REDACTED>
Workspace path of the notebook to run (for example, /Users/[email protected]/my-notebook): /Users/<FULL-USERNAME-REDACTED>/hello
Some key to apply to the job's tasks (for example, my-key): my-key
Attempting to run the job. Please wait...

Error:

Traceback (most recent call last):
  File "/Users/paul.cornell/databricks-sdk-py-demo/run-job.py", line 14, in <module>
    j = w.jobs.create(
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/databricks/sdk/service/jobs.py", line 2074, in create
    json = self._api.do('POST', '/api/2.1/jobs/create', body=body)
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 686, in do
    response = self.request(method, f"{self._cfg.host}{path}", params=query, json=body, headers=headers)
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/requests/sessions.py", line 573, in request
    prep = self.prepare_request(req)
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/requests/sessions.py", line 484, in prepare_request
    p.prepare(
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/requests/models.py", line 371, in prepare
    self.prepare_body(data, files, json)
  File "/Users/paul.cornell/databricks-sdk-py-demo/.venv/lib/python3.10/site-packages/requests/models.py", line 511, in prepare_body
    body = complexjson.dumps(json, allow_nan=False)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type set is not JSON serializable

Workspace Conf fails to get status

Hello,

When using the workspace_conf.get_status() in the WorkspaceClient, the method does not work and returns an AttributeError: type object 'dict' has no attribute 'from_dict'. I believe WorkspaceConf is just a dictionary and thus, it is raising an Attribute Error. Could you please verify and confirm if this is correct.

e.g. workspace_client.workspace_conf.get_status(keys="enableTokensConfig") does not work

Thank you.

`AttributeError: 'NoneType' object has no attribute 'debug_truncate_bytes'` when instantiating an `ApiClient` with `cfg = None`

Tried to run the following test:

from databricks.sdk import JobsAPI

    client = ApiClient()
    api = JobsAPI(client)
    cluster = JobCluster(
        job_cluster_key = "cluster1"
    )

    task1 = JobTaskSettings(
        task_key = "task1",
    )

    api.create(
        job_clusters = [cluster],
        tasks = [task1]
    )

This causes the following error:

self = <databricks.sdk.core.ApiClient object at 0x1265bcbb0>, cfg = None

    def __init__(self, cfg: Config = None):
        self._cfg = Config() if not cfg else cfg
>       self._debug_truncate_bytes = cfg.debug_truncate_bytes if cfg.debug_truncate_bytes else 96
E       AttributeError: 'NoneType' object has no attribute 'debug_truncate_bytes'

I've fixed the error in a separate branch and will submit a PR. Changes done:

Replaced:

def __init__(self, cfg: Config = None):
    self._cfg = Config() if not cfg else cfg
    self._debug_truncate_bytes = cfg.debug_truncate_bytes if cfg.debug_truncate_bytes else 96

With:

if cfg:
      self._cfg = cfg
      self._debug_truncate_bytes = cfg.debug_truncate_bytes if cfg.debug_truncate_bytes else 96
      self._user_agent_base = cfg.user_agent
else:
      self._cfg = Config()
      self._debug_truncate_bytes = 96
      self._user_agent_base = None

Query History list() does not pass filter_by correctly

Although it's not documented, the query history list endpoint seems to handle filter_by properly only if it is passed in the request body rather than as a query param:

Ran using databricks-sdk v0.1.5, requests v2.28.2:

In [2]: from databricks.sdk import WorkspaceClient
In [3]: client = WorkspaceClient(host=..., token=...)
In [4]: from databricks.sdk.service.sql import QueryFilter
In [6]: filter_by = QueryFilter.from_dict(
   ...:             {
   ...:                 "query_start_time_range": {
   ...:                     "start_time_ms": 0,
   ...:                     "end_time_ms": int(time.time() * 1000),
   ...:                 }
   ...:             }
   ...:         )

In [7]: filter_by
Out[7]: QueryFilter(query_start_time_range=TimeRange(end_time_ms=1683569801679, start_time_ms=0), statuses=None, user_ids=None, warehouse_ids=None)

In [9]: next(client.query_history.list(filter_by=filter_by))
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[9], line 1
----> 1 next(client.query_history.list(filter_by=filter_by))

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: Could not parse request object: Expected 'START_OBJECT' not 'VALUE_STRING'
 at [Source: (ByteArrayInputStream); line: 1, column: 15]
 at [Source: java.io.ByteArrayInputStream@794544b2; line: 1, column: 15]

but if I call:

In [53]: "res" in client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict()})
Out[53]: True

the API works as expected.

Second, when using pagination with the query history endpoint, it doesn't seem to allow specifying page_token and filter_by at the same time:

In [55]: client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[55], line 1
----> 1 client.query_history._api.do('GET', '/api/2.0/sql/history/queries', body={"filter_by": filter_by.as_dict(), "page_token": "abc"})

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

The current implementation doesn't remove filter_by on subsequent calls and thus doesn't paginate correctly when filter_by is passed in. Here, I patch the do call to pass query params as body, but this time we fail when we attempt to get the second page.

In [56]: original_do = client.api_client.do

In [57]: from unittest.mock import patch

In [62]: def patch_do(method, path, query = None, body = None):
    ...:     print(method,path,query,body)
    ...:     res = original_do(method, path, query = None, body=query)
    ...:     print("RES", res.keys())
    ...:     return res
    ...: 

In [64]: with patch.object(ApiClient, "do") as mock_do:
    ...:     mock_do.side_effect = patch_do
    ...:     it = client.query_history.list(filter_by=filter_by, max_results=1)
    ...:     next(it)
    ...:     next(it)
    ...: 
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1} None
RES dict_keys(['next_page_token', 'has_next_page', 'res'])
GET /api/2.0/sql/history/queries {'filter_by': {'query_start_time_range': {'end_time_ms': 1683569801679}}, 'max_results': 1, 'page_token': 'CkwKJDAxZWRlZGFjLWJjNDktMTgyOS1hM2UwLTYwNDYwZTVkNjU4MBCD2ZHe/zAY09WStKW5tAUiEDcyNTJjYmE1NTlmNDhkZjQo+JgQEgkSBxDP89Hk/zAYAQ=='} None
---------------------------------------------------------------------------
DatabricksError                           Traceback (most recent call last)
Cell In[64], line 5
      3 it = client.query_history.list(filter_by=filter_by, max_results=1)
      4 next(it)
----> 5 next(it)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/service/sql.py:2824, in QueryHistoryAPI.list(self, filter_by, include_metrics, max_results, page_token, **kwargs)
   2821 if page_token: query['page_token'] = request.page_token
   2823 while True:
-> 2824     json = self._api.do('GET', '/api/2.0/sql/history/queries', query=query)
   2825     if 'res' not in json or not json['res']:
   2826         return

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1114, in CallableMixin.__call__(self, *args, **kwargs)
   1112 self._mock_check_sig(*args, **kwargs)
   1113 self._increment_mock_call(*args, **kwargs)
-> 1114 return self._mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1118, in CallableMixin._mock_call(self, *args, **kwargs)
   1117 def _mock_call(self, /, *args, **kwargs):
-> 1118     return self._execute_mock_call(*args, **kwargs)

File ~/.pyenv/versions/3.10.9/lib/python3.10/unittest/mock.py:1179, in CallableMixin._execute_mock_call(self, *args, **kwargs)
   1177         raise result
   1178 else:
-> 1179     result = effect(*args, **kwargs)
   1181 if result is not DEFAULT:
   1182     return result

Cell In[62], line 3, in patch_do(method, path, query, body)
      1 def patch_do(method, path, query = None, body = None):
      2     print(method,path,query,body)
----> 3     res = original_do(method, path, query = None, body=query)
      4     print("RES", res.keys())
      5     return res

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/databricks/sdk/core.py:753, in ApiClient.do(self, method, path, query, body)
    749 if not response.ok:
    750     # TODO: experiment with traceback pruning for better readability
    751     # See https://stackoverflow.com/a/58821552/277035
    752     payload = response.json()
--> 753     raise self._make_nicer_error(status_code=response.status_code, **payload) from None
    754 if not len(response.content):
    755     return {}

DatabricksError: You can provide only one of 'page_token' or 'filter_by'

To summarize (sorry for the long post!), if my assumptions about the query history API are correct, then:

  1. All params should be passed in the body, even though it's a GET request
  2. The filter_by param should be removed when querying with page_token

groups parameter for `service_principals.create(...)` is ignored?

Please see below process where the service_principals.create function fails to add the specified group. Am I using the wrong datatype?

dbw=WorkspaceClient(...)

groups = {group.display_name:group for group in dbw.groups.list()}
print(groups)
{
 'users': Group(id='...', display_name='users', ...),
 'service_principals': Group(id='...', display_name='service_principals',...),
 'admins': Group(id='...', display_name='admins', ...)
}
sp = dbw.service_principals.create(
    id=secret_client.get_secret("...").value,
    application_id=secret_client.get_secret("...").value,
    display_name="...",
    groups=[
        groups["service_principals"] # this seems to be ignored and it gets added to users instead.
    ]
)
print(sp)
ServicePrincipal(
id='...',
active=True,
application_id='...',
display_name='...',
entitlements=None,
external_id=None,
groups=None, # <-------------------- ???
roles=None
)

the expected value for groups is

groups=[ComplexValue(display='service_principals', primary=None, type='direct', value='...')],

many thanks

Add examples for `grants` and other submodules

I see there are methods in w.grants but they're not documented in the examples path of the repo.

Also would be nice to add a README.md in the examples folder (or add it to the root README) stating that the structure of the SDK is the same as the URLs of the REST API explorer. That would make it easier to know and navigate the complete functionality of the SDK

Capture resourceType from GET Groups API

The SCIM Groups API returns a value under meta.resourceType which can have a value of Group (account level group) or WorkspaceGroup (workspace local group).

Currently, the SDK does not capture this piece of information

@dataclass
class Group:
    display_name: str = None
    entitlements: 'List[ComplexValue]' = None
    external_id: str = None
    groups: 'List[ComplexValue]' = None
    id: str = None
    members: 'List[ComplexValue]' = None
    roles: 'List[ComplexValue]' = None

Note that the value is present when accessing the API directly

image

I propose we add the meta.resourceType field to the Group data class. This functionality is currently leveraged by the UC-Migration project

Note that I'm happy to make an attempt at contributing this functionality. I don't see any contribution guides for this project.

Error while using filter parameter in list operation

Hello!

I'm not sure if I'm doing something wrong, but, while using the groups' list operation from WorkspaceClient:

03/05/2023 04:00:10 PM  :::DEBUG::::   Loaded from environment
03/05/2023 04:00:10 PM  :::DEBUG::::   Attempting to configure auth: pat
03/05/2023 04:00:10 PM  :::DEBUG::::   Starting new HTTPS connection (1): xxxxxxxxxxxx.cloud.databricks.com:443
03/05/2023 04:00:12 PM  :::DEBUG::::   https://xxxxxxxxxxxx.cloud.databricks.com:443 "GET /api/2.0/preview/scim/v2/Groups?filter=displayName%2Beq%2Bmy_group HTTP/1.1" 400 None
03/05/2023 04:00:12 PM  :::DEBUG::::   GET /api/2.0/preview/scim/v2/Groups?filter=displayName+eq+my_group
< 400 Bad Request
< {
<   "detail": "Given filter operator is not supported.",
<   "schemas": [
<     "urn:ietf:params:scim:api:messages:2.0:Error"
<   ],
<   "scimType": "InvalidFilter",
<   "status": "400"
< }
Traceback (most recent call last):
  File "<redacted>/main.py", line 6, in <module>
    databricks_groups = databricks_workspace.groups.list(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/databricks/sdk/service/iam.py", line 1291, in list
    json = self._api.do('GET', '/api/2.0/preview/scim/v2/Groups', query=query)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/databricks/sdk/core.py", line 753, in do
    raise self._make_nicer_error(status_code=response.status_code, **payload) from None
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: ApiClient._make_nicer_error() missing 1 required positional argument: 'message'

If I try the exact same filter using cURL, for example, everything works fine.

This is the main.py:

from databricks.sdk import WorkspaceClient
from modules.config.log import logger

databricks_workspace = WorkspaceClient()

databricks_groups = databricks_workspace.groups.list(
    filter="displayName+eq+my_group"
)

I'm using the SDK version 0.1.2 and Python version 3.11.3.

Make databricks-sdk work out of the box in notebooks

Right now two steps are needed to make the sdk work from notebooks:

  1. %pip install
  2. manually passing a token to the sdk by calling internal dbutils APIs to get one

So we don't need to solve #1 right away, but maybe we can just bake the logic from #2 into the SDK itself, just like e.g. mlflow works out of the box without passing a token around manually? Would be nice if you could just type w = WorkspaceClient() in a notebook!

[BUG] Storage Credentials create

WorkspaceClient.storage_credentials.create() requires Metastore ID which the API is not expecting. Throws error 'CreateStorageCredential metastore_id can not be provided.' when provided and 'StorageCredentialsAPI. create() missing 1 required positional argument: 'metastore_id' when not provided

Missing "not applicable" positional arguments for job task settings

With the following code that uses 0.0.1 of the SDK:

import os
from databricks.sdk import WorkspaceClient

host  = os.getenv('DATABRICKS_HOST')
token = os.getenv('DATABRICKS_TOKEN')

w = WorkspaceClient(host  = host, token = token, auth_type = "pat")

w.jobs.create(
  job_name = 'my-job',
  tasks = my_tasks
)

The my_tasks declaration doesn't seem to work no matter what syntax I use. For example:

Creating a list returns AttributeError: 'dict' object has no attribute 'as_dict':

my_tasks = [
  {
    "description": "My job.",
    "existing_cluster_id": "1128-232547-p64vrmx2",
    "notebook_task": {
      "notebook_path": "/Users/[email protected]/go-fakedata"
    },
    "task_key": "my-key"
  }
]

Creating a dictionary returns AttributeError: 'str' object has no attribute 'as_dict':

my_tasks = {
  "description": "My job.",
  "existing_cluster_id": "1128-232547-p64vrmx2",
  "notebook_task": {
    "notebook_path": "/Users/[email protected]/go-fakedata"
  },
  "task_key": "my-key"
}

jobs.list_runs does not return 'latest_repair_id'

Hello,

I wrote a script that can repair all the runs for a given job_id. However, when I have to repair for a second time, I come across this error:

w.jobs.repair_run(j.run_id, rerun_all_failed_tasks=True)
databricks.sdk.core.DatabricksError: The latest repair ID needs to be provided in order to create a new repair

I cannot find any output in jobs object that has this information. Can you provide some insight here? Thanks!

Token Cache Support

Because of the many tools we own, it would be greatly beneficial to have a single way to store tokens locally. keyring or encrypted file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.