exasol / bucketfs-python Goto Github PK

View Code? Open in Web Editor NEW

This project forked from exasol/bucketfs-utils-python

1.0 1.0 1.0 11.6 MB

BucketFS utilities for the Python programming language

Home Page: https://exasol.github.io/bucketfs-python

License: MIT License

Shell 4.29% Python 95.48% Dockerfile 0.23%

bucketfs exasol-integration foundation-library python

bucketfs-python's People

Stargazers

Watchers

Forkers

tkilias

bucketfs-python's Issues

Create a connection object for a bucket

Background

Connection objects are used from Exasol to store credentials or config for UDFs
We need to often supply credentials of the bucketfs and locations in the bucketfs to the UDFs
It would be good to have a function which generates a connection object create statement from a BucketFSLocation
This connection object then can be used inside a UDF to create BucketFSLocation object via the BucketFSFactory
Process: BucketFSLocation -> Connection Object Create Statement -> Connection Object in UDF -> BucketFSFactory in UDF -> BucketFSLocation in UDF

Acceptance Criteria

BucketFSLocation can create Connection object create statement
BucketFSFactory can create BucketFSLocation from this connection object

Create a path builder

Create a universal path builder function that will return a PathLike object. The function should take different sets of arguments for different file backends. The following backends should be supported.

On-prem BucketFS
SaaS BucketFS
BucketFS file system as seen by a UDF

Create bucketfs via XMLRPC

Background

Add the moment, user can only use existing bucketfs, but can't create one
Creating buckets requires admin access to the cluster configuration, which happens currently via XMLRPC

Acceptance Criteria

Add a function which creates a bucketds

🔧 Remove old BucketFs API and Package

Summary

Currently the exasol-bucketfs package contains two packages (exasol.bucketfs, exasol_bucketfs_utils_python) containing the new and the old API receptively. Once all dependencies to the old API are cut the exasol_bucketfs_utils_python package and their receptive tests should be removed.

References

Requires this issues to be solved first:

Task(s)

Remove old API package exasol_bucketfs_utils_python
Remove tests for exasol_bucketfs_utils_python package
Remove deprecated dependency
- Joblib
- Typeguard?

Compute hash sum by downloading from HTTP without persisting the downloaded file

Background:

We had in the past, often problems with corrupted files or wrongly uploaded files
Checking the checksum helped a lot in the past

Acceptance Criteria:

Implement a function which downloads a file via HTTP / HTTPS and computes the checksum on the fly
The checksum computation and the download should be streamed, to reduce the memory footprint
Different checksum should be usable, such as SHA512, SHA256, MD5, ...
The checksum should be compatible with the checksum you get from the command line tools sha512sum, sha256sum, md5sum
We want to avoid storing anything on disk

Download into byte string

Background

At the moment, we can only download to strings or files, however sometimes you want to download binary data

Acceptance Criteria

We can download binary data to byte string without any encoding from the Bucketfs

Ability to create Bucket

In order to be able to integrate Exasol closely with ML pipelines, this feature would be highly helpful.

List buckets of a certain bucketfs

Background

Currently, the user needs to specify which bucket to use, but the user has no way to discover the bucket names

Acceptance Criteria

Add a function to list the names of the buckets of a bucketfs

Improve documentation

Summary

Make sure the documentation is easy to use (e.g. all parameters in the API documentation are shown properly).

Examples:
Good vs Bad

Tasks

Make sure everything is documented properly
Add an examples section
Make sure code snippets are run or taken as part unit tests
(Doctests may be also an option, if they can be displayed easily within the docs using sphinx)
Adjust old references from bucketfs-util-python to bucketsfs-python
Add pypi based installation guide

Resources

Remove setup.py

Add logging to bucketfs-python library

Summary

Currently, the bucketfs-python library lacks logging functionality, which makes it challenging for users to debug and trace errors effectively. Adding logging capabilities will enhance the usability of the library by providing valuable insights into its runtime behavior.

Proposed Solution

Integrate logging functionality into the bucketfs-python library to enable users to easily monitor and troubleshoot operations. This should include configurable logging levels and options to customize log output.

Expected Outcome

With logging incorporated, users will have improved visibility into the library's internal operations, making it easier to diagnose issues.

Additional Information

Basic Logging HOWTO

Note: Ensure that the logger is appropriately named and can be controlled/configured by the logging configuration of the library users.

Compute hash sum for a file during upload

Background:

We had in the past, often problems with corrupted files or wrongly uploaded files
Checking the checksum helped a lot in the past

Acceptance Criteria:

You can enable this feature with an option, default is off
Add the checksum computation to the upload functions
The checksum computation should be streamed, to reduce the memory footprint
Different checksum should be usable, such as sha512, sha256, md5, ...
The checksum should be compatible with the checksum you get from the commandline tools sha512sum, sha256sum, md5sum

🔧 Cleanup and migrate integration tests and setup

Summary

All important integration tests for the old API should be migrated to the new API and made part of the integration tests suite of the new API.

Tasks

Migrate and integrate UDF integration tests
Add pytest based Integration test settings/configuration
- Buckets and DB settings used for integration tests
Add support for a pytest based setup of the integration tests (start db etc.)

🐞 Uploading pickled model to BucketFS does not work

Summary

Uploading a pickled ... model to BucketFS fails with an exception.

Reproducing the Issue

Product ML file

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create a dummy dataset with 10,000 rows
np.random.seed(42)
data = {'X1': np.random.rand(10000),
        'X2': np.random.rand(10000),
        'X3': np.random.rand(10000),
        'y': np.random.rand(10000)}
df = pd.DataFrame(data)

# Split the data into features (X) and target variable (y)
X = df[['X1', 'X2', 'X3']]
y = df['y']

# Create a linear regression model
model = LinearRegression().fit(X, y)

# Fit the model to the data
#model.

# Generate 2,000 rows for X_new
np.random.seed(42)
X_new = pd.DataFrame({'X1': np.random.rand(2000),
                      'X2': np.random.rand(2000),
                      'X3': np.random.rand(2000)})

# Predict on new data
y_pred = model.predict(X_new)

# Print the predicted values
print("Predicted values:")
print(y_pred)

# Calculate the mean squared error
y_pred_train = model.predict(X)
mse = mean_squared_error(y, y_pred_train)
print("Mean Squared Error:", mse)

Pickled model

import pickle
from sklearn.linear_model import LinearRegression

# Save the model to a file
filename = 'dummy_linear_regression_model.sav'
pickle.dump(model, open(filename, 'wb'))

print("Model saved successfully.")

Failing code

import io
from exasol.bucketfs import Service

URL = "http://localhost:2581"
CREDENTAILS = {"default": {"username": "w", "password": "BBiSzwGaD6X7zLcjfpcP0OdGA317JABg"}}

bucketfs = Service(URL, CREDENTAILS)
bucket = bucketfs["default"]

filename = 'dummy_linear_regression_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))


# Upload bytes
data = loaded_model
bucket["dummy/dummy_linear_regression_model.sav"] = data

# Upload file like object
# file_like = io.BytesIO(loaded_model)
# bucket.upload("dummy/dummy_linear_regression_model.sav", file_like)

# bucket.upload("dummy/dummy_linear_regression_model.sav", loaded_model)

Expected Behavior

Uploading model is successful.

Actual Behavior

Uploading model fails with exception

Root Cause (optional)

unknown

Screenshots

Reported by: @exa-eswar

Create the Pathlike protocol class as per the design doc

Refer to the design documentation

Check if bucketfs is reachable

Background

Currently, we simply try to download a file without checking if the bucketfs is reachable

Acceptance Criteria

Add a function to check if a bucketfs is reachable

Update typeguard version

typeguard 3.0.0 leads to TypeError: typechecked() got an unexpected keyword argument 'always'
this error temporarily handled in #58
remove version restriction after this error is fixed

Add minimal CLI

In the past numerous attempts have been made to mitigate interaction with Exasol's BucketFS:

✔️ bucketfs-python: Python, active, used for tests automation of various python projects.
✔️ bucketfs-java: Java, used for tests automation of various java projects.
❓ bucketfs-client: Java, limited functionality, currently not develeoped actively
❌ bucketfs-explorer: Java, GUI application, archived, deprecated
Currently contained in official documentation (see DOC-2221)
❌ bucketfs-utils-python: depecated, superseeded by bucketfs-python
❓ shell functions for bash based on CURL requests: only for power users, limited functionality, see below

function bucketfs-password() {
    if [ -z "$1" ]; then
       echo "usage: bucketfs-password <container>"
       return 1
    fi
    BUCKETFS_PASSWORD=$(
	docker exec -it $1 \
	       grep WritePass /exa/etc/EXAConf \
	    | sed -e "s/.* = //" \
	    | tr -d '\r' \
	    | base64 -d)
}

function bucketfs-upload() {
    if [ -z "$1" ]; then
       echo "usage: bucketfs-upload <file> [path/in/bucket-fs]"
       return 1
    fi
    if [ -z "$BUCKETFS_PASSWORD" ]; then
       echo "Please set environment variable BUCKETFS_PASSWORD"
       return 1
    fi
    A=2580 # port
    B=$(echo default/$2 | sed -e 's/\/$//') # path
    curl -v -X PUT -T $1 http://w:$BUCKETFS_PASSWORD@localhost:$A/$B/$1
}

Still users struggle and need help from time to time. The current ticket therefore requests to create a minimal CLI solution

convenient and easy to use
acceptable usability with sufficient functionality and at least some guidance for unexperienced users
lightweight with minimal footprint and prerequisites, e.g. frameworks and installations

In summary bucketfs-python seems to be the best candidate. It is used anyway for test automation of various python projects. Currently issue #4 requires to enhance bucketfs-python to list the contents of a folder in BucketFS which could be another building block in order to provide a minimal CLI with very limited effort.

📚 API DesignDoc: "DirectoryBucket"

Summary

Write an API specification for a "DirectoryBucket" (type name to be defined) for the bucketfs library.
The DirectoryBucket acts as a wrapper around a Bucket, targeted at a specific subdirectory, to facilitate object storage operations within that subdirectory context.

Goals

Simplify Path Management: Enable components to operate in isolated subdirectories without manual path tracking.
Enhance Error Handling: Reduce errors stemming from manual path management

Functionality

read(path: string): Read the contents of a file located at path within the subdirectory.
write(path: string, content: any): Write content to a file located at path within the subdirectory.
delete(path: string): Delete a file or directory located at path within the subdirectory.
files(): List all files in the current subdirectory.
directories(): List all direct subdirectories as DirectoryBucket instances.
join_path(*paths: string[]): Safely join multiple path segments, ensuring proper navigation within the subdirectory.

Note: Consider having the path operations in a path object which will/can be used by the DirectoryBucket.

Related Issues

Issue #68 on GitHub

Download returns IOBase stream

Background

At the moment, we can only download files completely from the BucketFS, but sometimes it would be useful to download it in chunks and work on these chucks

Acceptance Criteria

We provide a new function which returns an object derived from IOBase (has a file like API) which return the data through the read function. Seek and write is not implemented.

✨ Add better credentails support to new BucketFs API

Summary

Be more explicit and secure on how credentials are used within the bucketfs api.

Replace the default dict in dict credentials mapping passed to the service with a more sophisticated credentials provider,
which e.g. does not accidentally leak authentication information when printing it. Additionally provide more context
that credentials are mapped to specific buckets.

Details

Add Credential classes/objects
Credential classes/objects should not leak information when printed
Credential classes/objects Support explicit request for unsecure output
Add a more explicit data structure / class for the global credentials mapping/store

Examples / Ideas

Secure & Unsecure Output

credentials = Credentials(username='foo', password='bar')


>>> print(credentials)
Credentials(username: ****, password: ****)

>>> print(f'{credentails:unsecure}')
Credentials(username: foo, password: bar)

Global Credentails Store

store = CredentailStore(
      [
          BucketCredentails(bucket='default', username='user', password='pw'),
          BucketCredentails(bucket='myudfs', username='u', password='secret'),
          ...
     ]
)

store = CredentailStore(
      [
          { 'bucket': 'default', 'username': 'user', 'password': 'pw' },
          { 'bucket': 'myudfs', 'username': 'u', 'password': 'secret' },
          ...
     ]
)

store = credentails.Store(
      [
          credentials.Bucket(name='default', username='user', password='pw'),
          credentails.Bucket(name='myudfs', username='u', password='secret'),
          ...
     ]
)

New Usage

from exasol.bucketfs import Service
from exasol.bucketfs import credentails 

URL = "http://127.0.0.1:1234/"
STORE = credentails.Store(
    credentials.Bucket('default', username='w', password='w')
)
bucketfs = Service(URL, STORE)

Notes

Printing can/should be implemented by implementing __str__, __format__ and __repr__
Consider creating a sub module for the credentials code
Keep support for old credential usage but discourage it
The Store constructor should support a set of Credentials or just a single one (for simple use cases)
Think about for which parameters keyword argument passing should be enforced (e.g. username, password?)

Tasks

Add support for improved credentials
Add unit and integration tests for this feature(s)
Update the documentation to use new (more obvious) API for passing the credentials

✨ Conditional Running for SaaS Tests in CI

Summary

Add a control mechanism in our Continuous Integration pipeline to selectively execute the SaaS tests. Ideally, these specific tests should only run under certain conditions such as when explicitly invoked, either by a triggering commit or a manual activation via the workflow. This enhancement is intended to streamline our CI process, reducing unnecessary testing cycles and improving overall efficiency.

Furthermore, we should run them only for one python version.

User guide and Examples

It would be beneficial for the users to have a User guide (like other Exasol public repos) along with a few working examples.

Add python tooling (linter, formatters, ...)

Background

most Python packages today make heavy use of certain tools, like linters and formatters
Options:
- sonar, see https://github.com/exasol/github-issue-adapter/blob/1636d1ca2a9f752c2d45b76da57329bb07cf510f/.github/workflows/ci-build.yml#L30
- See the following list for open source tools: https://realpython.com/python-code-quality/

Acceptance Criteria

Select a proper list and add them to the project
Best add them as commands to https://github.com/nat-n/poethepoet
Add them to the CI
Make the project pass them

Upload stream of byte chunks from generator

Background

Currently, we can only provide upload for file-obj and strings, both functions upload the whole content at once. You can't generate the content on the fly.
Python Generators allow the generation of data and then yield the execution to the consumer. The consumer in our case would be the upload. This allows to alternate between generation and upload.

Acceptance Criteria

We have a new upload function which accepts generators of bytes objects.
Check if requests package already buffers the data to optimize the upload, otherwise implement the buffering

Add SSL certificate verification control to the old interface.

The old interface is still used by the extensions. Therefore we are still have a problem that the https comm won't work with the DockerDB since we can't turn the certificate verification off.

🔧 Refactor `examples` for more clarity

Summary

To enhance the accessibility of the example files in our documentation folder, we should refactor them.
The current approach, which utilizes a comment marker to distinguish between basic and advanced examples within a single file, has proven to be somewhat unclear.

Proposed Changes:

Split the Existing Example Files: Divide the current example files into two separate files:
- xxx_basic.py: This file will contain the basic usage example.
- xxx_advanced.py: This file will focus on more complex scenarios.

The primary goal of this refactoring is to make the example files less confusing when viewed in isolation. By creating distinct files for basic and advanced examples, we aim to facilitate a better understanding and improve the overall user experience.

Add a release workflow

Acceptance Criteria

Documentation generation
Release to PyPI
Release to Github Release

Simplify bucketfs package and API

Why

Simplifies API usage and API documentation
Improves readability of client and library code
Hides unnecessary internals
Provide a more pythonic experience to users of the library
Reduce the overall noise in the module
More clearly separate the internals from the actual "End User API"

How

Add a new package for the new API and structure to the workspace

The new package can make use of the old package and API (it can start as a "shim") and then bit by bit
integrate the required functionality without breaking existing code.

Reduce repetition in naming whenever the context is also providing that information

e.g.:

from exasol_bucket_fs_utils_python.bucketfs_location import BucketFSLocation

vs.

from exasol.bucketfs import Bucket

Note ℹ️ : If this would increase the size of single module to much, this also can be achieved by re-exporting.

Create new API

Example API

# import required function's and classes
from exasol.bucketfs import (
    BucketFs,
    Bucket,
# conversions likely should be implemented as functions
# as_file()
# as_string()
# ...
    AsFile,
    AsString,
    AsFileObject,
    AsJoblibObject
)


# Create bucketfs accessor object
# Note: The reading available bucketfs etc. from the bucketfs service does not require credentials 
bucketfs = BucketFs(
    host='localhost',
    port=1234,
)

# Create just a bucket accessor
# Note: consider taking the bucketfs service as parameters instead of the host/port.
bucket = Bucket(
    host='localhost',
    port=1234,
    username='readuser',
    password='readpw',
    bucket='bucket',
    ...
)

# Access buckets
for bucket in bucketfs:
    print(bucket)

# Retrieve a specific bucket
bucket = bucketfs['bucketname']

# Upload data to a bucket
file_on_bucket = bucket.upload(content="Some String Content")
file_on_bucket = bucket.upload(content="Some String Content", name="explicit_filename.txt")

with open('/some/file.txt', 'r') as f:
    file_on_bucket = bucket.upload(content=f)


# List files in  a bucket
for file in bucket:
    print(file)

# Download data from a bucket
file_content = bucket['my_text_file.txt']
file_content = bucket.download('my_text_file.txt')

# Conversion helpers
file = AsFile(file_content, '/home/my_file.txt')
string = AsString(bucket['myfile.txt'], encoding='utf-8')
joblib_obj = AsJoblibObject(file_content)

# Delete data from a bucket
bucket.delete('my_text_file.txt')
del bucket['my_text_file.txt']

Remove old API

Once the migration is complete, the old API and package can be deleted and a new version can be release.
(One may consider transitional releases with support for the old and new API + deprecation warnings)

Compute hash for all files in bucket via http download

Background:

We had in the past, often problems with corrupted files or wrongly uploaded files
Checking the checksum helped a lot in the past

Acceptance Criteria:

Implement a function which computes the checksum for all files in the bucket via http/s downloads
The checksum computation should be streamed, to reduce the memory footprint
Potentially parallelize this to speed up the computation
Provide a parameter to set the nrOfCores used for parallel computation
Different checksum should be usable, such as sha512, sha256, md5, ...
The checksum should be compatible with the checksum you get from the commandline tools sha512sum, sha256sum, md5sum

Wrong assumption of a path in a bucket always having subdirectories

Summary

If a bucket url points to the bucket root then the line below fails

            base_path_in_bucket = PurePosixPath(url_path.parts[2]).joinpath(
                *url_path.parts[3:]
            )

This is in the BucketFSFactory.create_bucketfs_location in bucketfs_factory.py at the time of writing in the line 45.
See here.

Related Issues

Investigate ambiguous paths in BucketFS

Java lib contains a test case for a path in BucketFS being a folder as well as a file at the same time.

See tests in bucketfs-java that should cover all possible situations.

See explanation in bucketfs-client User guide

🐞 Performance Regression in `bucketfs-python` Compared to `curl` and Previous API

@ahsimb reports bucketfs-python to be multiple times slower than curl.

Summary

The new bucketfs-python API is significantly slower when transferring large files (multiple MBs/GBs) compared to using curl and the previous API version.

Reproducing the Issue

Reproducibility: always

Steps to reproduce the behavior:

Use the new bucketfs-python API to upload a large file (several MBs or GBs).

import exasol.bucketfs as bfs  # type: ignore

bucketfs = bfs.Service(buckfs_url, buckfs_credentials)
bucket = bucketfs[bucket_name]
bucket.upload(bfs_file_name, pickle.dumps(object))

Compare the upload time with that of curl and the older bucketfs-python API method.
Old API:

exasol_bucketfs_utils_python.bucketfs_location.BucketFSLocation.upload_fileobj_to_bucketfs

Expected Behaviour

The new bucketfs-python API should offer comparable performance to the old API and ideally also to methods like curl.

Actual Behaviour

The upload process with the new API is significantly slower than using curl and the previous API version, affecting efficiency and throughput for large file transfers.

Upload from byte string

Background

At the moment, we can only upload strings or files, however sometimes you want to upload binary data, but the data doesn't come from a file

Acceptance Criteria

We can upload binary data from byte string without any encoding to the Bucketfs

Move language container fixtures to own repository

Copies of the language container fixtures currently exist in multiple repositories of ours. It would be preferable to move them to their own repository, so we only use one centralized version.

Create new repository
Move files
Make sure all Projects use the files from the repository

Check if file exists in bucket

Background

Currently, you can only hope that a file you want to download exists

Acceptance Criteria

Add a function to check if a file or directory exists in the bucketfs

Upload directory to bucket

Background

We can currently only upload files to bucket
Uploading directories with complex directory structure is very difficult with this functionality.
We need to add functionality of uploading directly directory to bucket.

Method

Directories could be zipped before uploading so that we can upload then zipped file to bucket.
After completing upload operation, we have to unzip the file and extract the directory.

Acceptance Criteria

Add upload_directory to BucketFSLocation and LocalBucketFSLocation
Add unit and integration tests

🐞 Generating and deploying multi version documentation fails

Summary

Generating and deploying multi version documentation does not work properly in all scenarios and workflows.

Reproducing the Issue

Scenario 1:

Re-run a workflow which was run successfully for a push on master/main

Scenario 2:

Run CI-CD workflow (triggered by tag push)

Expected Behavior

Entire multi version documentation gets built and deployed to GitHub pages.

Actual Behavior

Documentation build/workflow fails.

Root Cause (optional)

No clear single root cause identified yet.

Leads

There are two major issues which have been identified so far:

The docs build expects a setup/structure which causes a broken rendering of the API docs, otherwise it fails

/tmp/tmpywesr85d/worktrees/worktree_source/doc/api.rst:4:toctree contains reference to nonexisting document 'api/exasol_bucketfs_utils_python'

The sgpg tool expects different parameters depending if a tag or branch is used as source, therefore the unparamerized
github workflow won't work properly in all cases.

Related Issues (optional)

Bug in the definition of the StorageBackend enum

auto is missed here

Typo in documentation

bucketfs-python/doc/examples/quickstart.py

Line 7 in 01cfb9e

CREDENTAILS = {"default": {"username": "w", "password": "write"}}

CREDENTAILS -> CREDENTIALS

Add BucketFS overview/architecture diagram to documentation

Background:

The BucketFS can be quite complex, for that reason, an overview might be helpful.

Implement the SaaS Service and Bucket

Provide an implementation form the SaasFileApi protocol

Restrict typeguard version

typeguard 3.0.0 leads to TypeError: typechecked() got an unexpected keyword argument 'always'
use it as typeguard = "^2.11.1"

Compute hash sum for a file during download

Background:

We had in the past, often problems with corrupted files or wrongly uploaded files
Checking the checksum helped a lot in the past

Acceptance Criteria:

You can enable this feature with an option, default is off
Add the computation of the checksum to the download function
The checksum computation should be streamed, to reduce the memory footprint
Different checksum should be usable, such as sha512, sha256, md5, ...
The checksum should be compatible with the checksum you get from the commandline tools sha512sum, sha256sum, md5sum

The SSL certificate verification control doesn't work

There is now the verify parameter in the constructors of both the Service and the Bucket classes. However, when the bucket is accessed through the service, which is the conventional way of getting to it, the 'verify' parameter is not passed from the service to the bucket.

        return {
            name: Bucket(
                name=name,
                service=self._url,
                username=self._authenticator[name]["username"],
                password=self._authenticator[name]["password"],
                service_name=self._service_name
            )
            for name in buckets
        }

Create a bridge between the Pathlike protocol and the Bucket API

✨ Add extra features to new BucketFs API

Summary

Add missing extra features to new BucketFs API to match needs of UDF use cases

Details

Location / BucketFsLocation

Implement a location "type" which provides to ability to operate on a subpath within a bucket.
Ask @tkilias for more details.

UDF path

Provide api to deduce the bucketfs path within a UDF.

Task(s)

Add Location support
Add UDF path/url support

Add DirectoryBucket (Pathlike, BucketPath, ...) to new API

Background

The new API is object-oriented
It uses objects like the Service, Bucket, MappedBucket
A MappedBucket is in that sense an Adapter for a Bucket. This way we can use composition instead of inheritance for the implementation.
However, Bucket and MappedBucket require the user to specify the path from the root
For more complex usage scenarios where multiple components of an application need to store objects in the BucketFS, we want that they can do that independent of each other.
- For example, each component could use its own subdirectory. However, managing the absolute paths to subdirectories manually is error-prone.
For that reason, we need a DirectoryBucket which gets a Bucket and a path to a subdirectory and writes and reads objects below the subdirectory

Acceptance Criteria

Compute hash for files in glob in bucket from the filesystem of the UDFs

Background:

We had in the past, often problems with corrupted files or wrongly uploaded files
Checking the checksum helped a lot in the past

Acceptance Criteria:

Implement a function which computes the checksum for all files in a glob in the bucket from the file system in the UDFs
The checksum computation should be streamed, to reduce the memory footprint
Potentially parallelize this to speed up the computation
Provide a parameter to set the nrOfCores used for parallel computation
Different checksum should be usable, such as sha512, sha256, md5, ...
The checksum should be compatible with the checksum you get from the commandline tools sha512sum, sha256sum, md5sum
Two parts, one part runs in the UDF, other part creates a UDF and runs it, but assume the python package is already installed in the used language alias/container

Create bucket via XMLRPC

Background

Add the moment, user can only use existing buckets, but can't create one
Creating buckets requires admin access to the cluster configuration, which happens currently via XMLRPC

Use Cases

In order to be able to integrate Exasol closely with ML pipelines, this feature would be highly helpful.
-- @exa-eswar

Acceptance Criteria

Add a function which creates a bucket in an existing bucketfs